Opening file in Ruby - ruby

Code sample 1:
def count_lines1(file_name)
open(file_name) do |file|
count = 0
while file.gets
count += 1
end
count
end
end
Code sample 2:
def count_lines2(file_name)
file = open(file_name)
count = 0
while file.gets
count += 1
end
count
end
I am wondering which is the better way to implement the counting of lines in a file. In terms of good syntax in Ruby.

which is the better way to implement the counting of lines in a file.
Neither. Ruby can do it easily using foreach:
def count_lines(file_name)
lines = 0
File.foreach(file_name) { lines += 1 }
lines
end
If I run that against my ~/.bashrc:
$ ruby test.rb
37
foreach is very fast and will avoid scalability problems.
Alternately, you could take advantage of tools in the OS, such as wc -l which were written specifically for the task:
`wc -l .bashrc`.to_i
which will return 37 again. If the file is huge, wc will likely outrun doing it in Ruby because wc is written in compiled code.
You can also read in large chunks with read and count newline characters.
Yes, read will allow you to do that, but the scalability issue will remain. In my environment read or readlines can be a script killer because we often have to process files well into the tens of GB. There's plenty of RAM to hold the data, but the I/O suffers because of the overhead of slurping the data. "Why is "slurping" a file not a good practice?" goes into this.
An alternate way of reading in big chunks is to tell Ruby to read a set block size, count the line-ends in that block, looping until the file is read completely. I didn't test that method in the above linked answer, but in the past did similar things when I was writing in Perl and found that the difference didn't really improve things because it resulted in a bit more code. At that point, if all I was doing was counting lines, it'd make more sense to call wc -l and let it do the work as it'd be a lot faster for coding time and most likely in execution time.

Related

Processing big amount of CSV files

I'll try to extend a title of my question. I work on ruby project. I have to process a big amount of data (around 120000) stored in CSV files. I have to read this data, process and put in DB. Now it takes couple days. I have to make this much faster. The problem is that sometimes during processing I get some anomalies and I have to repeat whole import process. I decided that more important is to improve performance instead of looking for bug using small amount of data. For now I stick to CSV files. I decided to benchmark processing script to find bottle necks and improve loading data from CSV. I see following steps:
Benchmark and fix the most problematic bottle necks
Maybe split loading from CSV and processing. For example create separate table and load data there. In next step load this data, process and put in right table.
Introduce threads to load data from CSV
For now I use standard ruby CSV library. Do you recommend some better gem?
If some of you are familiar in similar problem It would be happy to get to know you opinion.
Edit:
Database: postgrees
System: linux
I haven't had the opportunity to test it myself but refently I crossed this article, seems to do the job.
https://infinum.co/the-capsized-eight/articles/how-to-efficiently-process-large-excel-files-using-ruby
You'll have to adapt to use CSV instead of XLSX.
For future reference if the site would stop here the code.
It works by writing BATCH_IMPORT_SIZE records at the database at the same time, should give a huge profit.
class ExcelDataParser
def initialize(file_path)
#file_path = file_path
#records = []
#counter = 1
end
BATCH_IMPORT_SIZE = 1000
def call
rows.each do |row|
increment_counter
records << build_new_record(row)
import_records if reached_batch_import_size? || reached_end_of_file?
end
end
private
attr_reader :file_path, :records
attr_accessor :counter
def book
#book ||= Creek::Book.new(file_path)
end
# in this example, we assume that the
# content is in the first Excel sheet
def rows
#rows ||= book.sheets.first.rows
end
def increment_counter
self.counter += 1
end
def row_count
#row_count ||= rows.count
end
def build_new_record(row)
# only build a new record without saving it
RecordModel.new(...)
end
def import_records
# save multiple records using activerecord-import gem
RecordModel.import(records)
# clear records array
records.clear
end
def reached_batch_import_size?
(counter % BATCH_IMPORT_SIZE).zero?
end
def reached_end_of_file?
counter == row_count
end
end
https://infinum.co/the-capsized-eight/articles/how-to-efficiently-process-large-excel-files-using-ruby
Make sure your Ruby script can process multiple files given as parameters, i.e. that you can run it like this:
script.rb abc.csv def.csv xyz.csv
Then parallelise it using GNU Parallel like this to keep all your CPU cores busy:
find . -name \*.csv -print0 | parallel -0 -X script.rb
The -X passes as many CSV files as possible to your Ruby job without exceeding maximum length of command line. You can add in -j 8 after parallel if you want GNU Parallel to run say 8 jobs at a time, and you can use --eta to get the estimated arrival/finish time:
find . -name \*.csv -print0 | parallel -0 -X -j 8 --eta script.rb
By default, GNU Parallel will run as many jobs in parallel as you have CPU cores.
There are a few ways of going about it. I personally would recommend SmarterCSV, which makes it much faster and easier to process CSVs using Array of Hashes. You should definitely split up the work if possible, perhaps making a queue of files to process and do it in batches with use of Redis

Why is Ruby CSV file reading very slow?

I have a fairly large CSV file, with 4 Million records with 375 fields, that needs to be processed.
I'm using the RUBY CSV library to read this file and it is very slow. I thought PHP CSV file processing was slow but comparing the two reads PHP is is more then 100 times faster. I'm not sure if I'm doing something dumb or this is just the reality of RUBY not being optimized for this type of batch processing. I set up simple test pgms to get comparative times in both RUBY and PHP. All I do is read, no writing, no building of big arrays, and break out of the CSV read loops after processing 50,000 records. Has anyone else experienced this performance issue?
I'm running locally on a MAC with 4gig of memory, running OS X 10.6.8 and Ruby 1.8.7.
The Ruby process takes 497 seconds to simply read 50,000 records, the PHP process runs in 4 seconds which is not a typo, it's more then 100 times faster. FYI - I had code in the loops to print out data values to make sure that each of the processes was actually reading the files and bringing data back.
This is the Ruby Code:
require('time')
require('csv')
x=0
t1=Time.new
CSV.foreach(pathfile) do |row|
x += 1
if x > 50000 then break end
end
t2 = Time.new
puts " Time to read the file was #{t2-t1} seconds"
Here is the PHP code:
$t1=time();
$fpiData = fopen($pathdile,'r') or die("can not open input file ");
$seqno=0;
while($inrec = fgetcsv($fpiData,0,',','"')) {
if ($seqno > 50000) break;
$seqno++;
}
fclose($fpiData) or die("can not close input data file");
$t2=time();
$t3=$t2-$t1;
echo "Start time is $t1 - end time is $t2 - Time to Process was " . $t3 . "\n";
You'll likely get a massive speed boost by simply updating to a current version of Ruby. in Version 1.9, FasterCSV was integrated as Ruby's standard CSV library.
Check out Chruby to manage your different Ruby versions.
Check out the smarter_csv Gem, which has special options for handling huge files by reading data in chunks.
It also returns the CSV data as hashes, which can make it easier to insert or update the data in a database.
I think that using CSV is little bit overkill for this.
A long time ago I saw this question, and the reason for the slowness of the Ruby is that it loads the entire CSV file into the memory at once. I have seen some people overcome this issue by using the IO class. For example take a look at this gist for its self.perform(url) method.

How do I create this File Input and Output assignment in Ruby

I have an assignment that I am not sure what to do. Was wondering if anyone could help. This is it:
Create a program that allows the user to input how many hours they exercised for today. Then the program should output the total of how many hours they have exercised for all time. To allow the program to persist beyond the first run the total exercise time will need to be written and retrieved from a file.
My code is this so far:
myFileObject2 = File.open("exercise.txt")
myFileObjecit2.read
puts "This is an exercise log. It keeps track of the number hours of exercise."
hours = gets.to_f
myFileObject2.close
Write your code like:
File.open("exercise.txt", "r") do |fi|
file_content = fi.read
puts "This is an exercise log. It keeps track of the number hours of exercise."
hours = gets.chomp.to_f
end
Ruby's File.open takes a block. When that block exits File will automatically close the file. Don't use the non-block form unless you are absolutely positive you know why you should do it another way.
chomp the value you get from gets. This is because gets won't return until it sees a trailing END-OF-LINE, which is usually a "\n" on Mac OS and *nix, or "\r\n" on Windows. Failing to remove that with chomp is the cause of much weeping and gnashing of teeth in unaware developers.
The rest of the program is left for you to figure out.
The code will fail if "exercise.txt" doesn't already exist. You need to figure out how to deal with that.
Using read is bad form unless you are absolutely positive the file will always fit in memory because the entire file will be read at once. Once it is in memory, it will be one big string of data so you'll have to figure out how to break it into an array so you can iterate it. There are better ways to handle reading than read so I'd study the IO class, plus read what you can find on Stack Overflow. Hint: Don't slurp your files.

Fastest Way to Parse a Large File in Ruby

I have a simple text file that is ~150mb. My code will read each line, and if it matches certain regexes, it gets written to an output file.
But right now, it just takes a long time to iterate through all of the lines of the file (several minutes) doing it like
File.open(filename).each do |line|
# do some stuff
end
I know that it is the looping through the lines of the file that is taking a while because even if I do nothing with the data in "#do some stuff", it still takes a long time.
I know that some unix programs can parse large files like this almost instantly (like grep), so I am wondering why ruby (MRI 1.9) takes so long to read the file, and is there some way to make it faster?
It's not really fair to compare to grep because that is a highly tuned utility that only scans the data, it doesn't store any of it. When you're reading that file using Ruby you end up allocating memory for each line, then releasing it during the garbage collection cycle. grep is a pretty lean and mean regexp processing machine.
You may find that you can achieve the speed you want by using an external program like grep called using system or through the pipe facility:
`grep ABC bigfile`.split(/\n/).each do |line|
# ... (called on each matching line) ...
end
File.readlines.each do |line|
#do stuff with each line
end
Will read the whole file into one array of lines. It should be a lot faster, but it takes more memory.
You should read it into the memory and then parse. Of course it depends on what you are looking for. Don't expect miracle performance from ruby, especially comparing to c/c++ programs which are being optimized for past 30 years ;-)

Increasing the Loading Speed of Large Files

There are two large text files (Millions of lines) that my program uses. These files are parsed and loaded into hashes so that the data can be accessed quickly. The problem I face is that, currently, the parsing and loading is the slowest part of the program. Below is the code where this is done.
database = extractDatabase(#type).chomp("fasta") + "yml"
revDatabase = extractDatabase(#type + "-r").chomp("fasta.reverse") + "yml"
#proteins = Hash.new
#decoyProteins = Hash.new
File.open(database, "r").each_line do |line|
parts = line.split(": ")
#proteins[parts[0]] = parts[1]
end
File.open(revDatabase, "r").each_line do |line|
parts = line.split(": ")
#decoyProteins[parts[0]] = parts[1]
end
And the files look like the example below. It started off as a YAML file, but the format was modified to increase parsing speed.
MTMDK: P31946 Q14624 Q14624-2 B5BU24 B7ZKJ8 B7Z545 Q4VY19 B2RMS9 B7Z544 Q4VY20
MTMDKSELVQK: P31946 B5BU24 Q4VY19 Q4VY20
....
I've messed around with different ways of setting up the file and parsing them, and so far this is the fastest way, but it's still awfully slow.
Is there a way to improve the speed of this, or is there a whole other approach I can take?
List of things that don't work:
YAML.
Standard Ruby threads.
Forking off processes and then retrieving the hash through a pipe.
In my usage, reading all or part the file into memory before parsing usually goes faster. If the database sizes are small enough this could be as simple as
buffer = File.readlines(database)
buffer.each do |line|
...
end
If they're too big to fit into memory, it gets more complicated, you have to setup block reads of data followed by parse, or threaded with separate read and parse threads.
Why not use the solution devised through decades of experience: a database, say SQLlite3?
(To be different, although I'd first recommend looking at (Ruby) BDB and other "NoSQL" backend-engines, if they fit your need.)
If fixed-sized records with a deterministic index are used then you can perform a lazy-load of each item through a proxy object. This would be a suitable candidate for a mmap. However, this will not speed up the total access time, but will merely amortize the loading throughout the life-cycle of the program (at least until first use and if some data is never used then you get the benefit of never loading it). Without fixed-sized records or deterministic index values this problem is more complex and starts to look more like a traditional "index" store (eg. a B-tree in an SQL back-end or whatever BDB uses :-).
The general problems with threading here are:
The IO will likely be your bottleneck around Ruby "green" threads
You still need all the data before use
You may be interested in the Widefinder Project, just in general "trying to get faster IO processing".
I don't know too much about Ruby but I have had to deal with the problem before. I found the best way was to split the file up into chunks or separate files then spawn threads to read each chunk in at a single time. Once the partitioned files are in memory combining the results should be fast. Here is some information on Threads in Ruby:
http://rubylearning.com/satishtalim/ruby_threads.html
Hope that helps.

Resources