i need to run rake script for specific time. For example 10 minutes, 1 hour, etc., and if the script it's not finished stop it anyway.
Why i need this ? Because after some hours the memory is full!
Any suggestion ?
Thanks
I think you should first consider what is making this script use up so much memory.
(one thing would be loading up lots of records from the database, and appending them to an array)
But assuming you have already done everything you can,
I'd do something like this.
LIVE_FOR = 1.hour
def run!
finish_before = LIVE_FOR.from_now
array = get_the_array # some big collection to operate on
array.each do |object|
while Time.now < finish_before
...
end
end
end
But really, i'd first try to tackle why you have a memory leak.
Related
Problem
Howdy guys, so I want to find the run time of a block of code in Ruby, but I am not entirely sure as to how I could do it. I want to run some code, and then output how long it took to run that code because I have a super huge program and the run time changes a lot. I want to make sure it always has a consistent run time (I could do it by sleeping it for a fraction of a second) but that isn't my problem. I want to find out how long the run time actually is so the program can know if it needs to slow things down or speed things up.
My Thoughts
So, I have an idea as to how it could work. I have never used Time in ruby but I have an idea as to how I could use that. I could have a variable equal to the time (in milliseconds) and then another variable that I make at the end of the code block that does it again, and then I just subtract them, but I have (1) never used Time and (2) I don't actually know if that is the best way.
Thanks in advance!
Ruby has the Benchmark module for timing how long things take. I've never used this outside of seeing if a method is taking too long to run, etc. in development, not sure if this is 'recommended' for production code or for keeping things above a minimum runtime (as it sounds like you might be doing), but take a look and see how it feels for your use case.
It also sounds like you might be interested in the Timeout module as well (for making sure things don't take longer than a set amount of time).
If you really have a use case for making sure something takes a minimum amount of time, timing the code (either using a Benchmark method or just Time or another solution) and then sleep the difference is the only thing that comes to mind.
It is simple. Look at your watch (Time.now) and remember the time, run the code, look at your watch again, subtract.
t0 = Time.now
# your block of code
puts Time.now - t0
[http://ruby-doc.org/core-1.9.3/Time.html
You want to to use the Time object. (Time Docs)
For example,
start = Time.now
# code to time
finish = Time.now
diff = finish - start
diff would be in seconds, as a floating point number.
EDIT: end is reserved.
or you can use
require 'benchmark'
def foo
time = Benchmark.measure {
code to test
}
puts time.real #or save it to logs
end
Sample output:
2.2.3 :001 > foo
5.230000 0.020000 5.250000 ( 5.274806)
Values are CPU time, system time, total and real elapsed time.
[http://ruby-doc.org/stdlib-2.0.0/libdoc/benchmark/rdoc/Benchmark.html#method-c-bm
Source: Ruby docs.
I'll try to extend a title of my question. I work on ruby project. I have to process a big amount of data (around 120000) stored in CSV files. I have to read this data, process and put in DB. Now it takes couple days. I have to make this much faster. The problem is that sometimes during processing I get some anomalies and I have to repeat whole import process. I decided that more important is to improve performance instead of looking for bug using small amount of data. For now I stick to CSV files. I decided to benchmark processing script to find bottle necks and improve loading data from CSV. I see following steps:
Benchmark and fix the most problematic bottle necks
Maybe split loading from CSV and processing. For example create separate table and load data there. In next step load this data, process and put in right table.
Introduce threads to load data from CSV
For now I use standard ruby CSV library. Do you recommend some better gem?
If some of you are familiar in similar problem It would be happy to get to know you opinion.
Edit:
Database: postgrees
System: linux
I haven't had the opportunity to test it myself but refently I crossed this article, seems to do the job.
https://infinum.co/the-capsized-eight/articles/how-to-efficiently-process-large-excel-files-using-ruby
You'll have to adapt to use CSV instead of XLSX.
For future reference if the site would stop here the code.
It works by writing BATCH_IMPORT_SIZE records at the database at the same time, should give a huge profit.
class ExcelDataParser
def initialize(file_path)
#file_path = file_path
#records = []
#counter = 1
end
BATCH_IMPORT_SIZE = 1000
def call
rows.each do |row|
increment_counter
records << build_new_record(row)
import_records if reached_batch_import_size? || reached_end_of_file?
end
end
private
attr_reader :file_path, :records
attr_accessor :counter
def book
#book ||= Creek::Book.new(file_path)
end
# in this example, we assume that the
# content is in the first Excel sheet
def rows
#rows ||= book.sheets.first.rows
end
def increment_counter
self.counter += 1
end
def row_count
#row_count ||= rows.count
end
def build_new_record(row)
# only build a new record without saving it
RecordModel.new(...)
end
def import_records
# save multiple records using activerecord-import gem
RecordModel.import(records)
# clear records array
records.clear
end
def reached_batch_import_size?
(counter % BATCH_IMPORT_SIZE).zero?
end
def reached_end_of_file?
counter == row_count
end
end
https://infinum.co/the-capsized-eight/articles/how-to-efficiently-process-large-excel-files-using-ruby
Make sure your Ruby script can process multiple files given as parameters, i.e. that you can run it like this:
script.rb abc.csv def.csv xyz.csv
Then parallelise it using GNU Parallel like this to keep all your CPU cores busy:
find . -name \*.csv -print0 | parallel -0 -X script.rb
The -X passes as many CSV files as possible to your Ruby job without exceeding maximum length of command line. You can add in -j 8 after parallel if you want GNU Parallel to run say 8 jobs at a time, and you can use --eta to get the estimated arrival/finish time:
find . -name \*.csv -print0 | parallel -0 -X -j 8 --eta script.rb
By default, GNU Parallel will run as many jobs in parallel as you have CPU cores.
There are a few ways of going about it. I personally would recommend SmarterCSV, which makes it much faster and easier to process CSVs using Array of Hashes. You should definitely split up the work if possible, perhaps making a queue of files to process and do it in batches with use of Redis
I am writing a program that loads data from four XML files into four different data structures. It has methods like this:
def loadFirst(year)
File.open("games_#{year}.xml",'r') do |f|
doc = REXML::Document.new f
...
end
end
def loadSecond(year)
File.open("teams_#{year}.xml",'r') do |f|
doc = REXML::Document.new f
...
end
end
etc...
I originally just used one thread and loaded one file after another:
def loadData(year)
time = Time.now
loadFirst(year)
loadSecond(year)
loadThird(year)
loadFourth(year)
puts Time.now - time
end
Then I realized that I should be using multiple threads. My expectation was that loading from each file on a separate thread would be very close to four times as fast as doing it all sequentially (I have a MacBook Pro with an i7 processor):
def loadData(year)
time = Time.now
t1 = Thread.start{loadFirst(year)}
t2 = Thread.start{loadSecond(year)}
t3 = Thread.start{loadThird(year)}
loadFourth(year)
t1.join
t2.join
t3.join
puts Time.now - time
end
What I found was that the version using multiple threads is actually slower than the other. How can this possibly be? The difference is around 20 seconds with each taking around 2 to 3 minutes.
There are no shared resources between the threads. Each opens a different data file and loads data into a different data structure than the others.
I think (but I'm not sure) the problem is that you are reading (using multiple threads) contents placed on the same disk, so all your threads can't run simultaneously because they wait for IO (disk).
Some days ago I had to do a similar thing (but fetching data from network) and the difference between sequential vs threads was huge.
A possible solution could be to load all file content instead of load it like you did in your code. In your code you read contents line by line. If you load all the content and then process it you should be able to perform much better (because threads should not wait for IO)
It's impossible to give a conclusive answer to why your parallel problem is slower than the sequential one without a lot more information, but one possibility is:
With the sequential program, your disk seeks to the first file, reads it all out, seeks to the 2nd file, reads it all out, and so on.
With the parallel program, the disk head keeps moving back and forth trying to service I/O requests from all 4 threads.
I don't know if there's any way to measure disk seek time on your system: if so, you could confirm whether this hypothesis is true.
I had an implementation of a min heap in ruby that I wanted to test against more professional code but I cannot get Kanwei's MinHeap to work properly.
This:
mh = Containers::MinHeap.new # Min_Binary_Heap.new
for i in 0..99999
mh.push(rand(9572943))
end
t = Time.now
for i in 0..99999
mh.pop
end
t = Time.now - t
print "#{t}s"
The version I have performs the same popping operations on 100,000 values in ~2.2s, which I thought was extremely slow, but this won't even finish running. Is that expected or am I doing something wrong?
I don't think you are doing something wrong.
Looking at the source (https://github.com/kanwei/algorithms/blob/master/lib/containers/heap.rb), put a puts statement for when you finish setting up the heap. It looks like a very memory intensive operation to put the elements in (potentially resorting each time), so it might help you working through it.
I'm also not sure about him creating a node class for each actual node. Since they won't get cleaned up, there's going to be around 100,000 objects in memory by the time you are done.
Not sure how much help that is, maybe see how the source differs from your attempt?
I'm working with 6 csv files that each contain the attributes of an object. I can read them one at a time, but the idea of splitting each one to a thread to do in parallel is very appealing.
I've created a database object (no relational DBs or ORMs allowed) that has an array for each of the objects it is holding. I've tried the following to make each CSV open and initialize concurrently, but have seen no impact on speed.
threads = []
CLASS_FILES.each do |klass, filename|
threads << Thread.new do
file_to_objects(klass, filename)
end
end
threads.each {|thread| thread.join}
update
end
def self.load(filename)
CSV.open("data/#{filename}", CSV_OPTIONS)
end
def self.file_to_objects(klass, filename)
file = load(filename)
method_name = filename.sub("s.csv","")
file.each do |line|
instance = klass.new(line.to_hash)
Database.instance.send("#{method_name}") << instance
end
end
How can I speed things up in ruby (MRI 1.9.3)? Is this a good case for Rubinius?
Even though Ruby 1.9.3 uses native threads in order to implement concurrency, it has a global interpreter lock which makes sure only one thread executes at a time.
Therefore, nothing really runs in parallel in C Ruby. I know that JRuby imposes no internal lock on any thread, so try using it to run your code, if possible.
This answer by Jörg W Mittag has a more in-depth look at the threading models of the several Ruby implementations. It isn't clear to me whether Rubinius is fit for the job, but I'd give it a try.