Kanwei minheap slow ruby - ruby

I had an implementation of a min heap in ruby that I wanted to test against more professional code but I cannot get Kanwei's MinHeap to work properly.
This:
mh = Containers::MinHeap.new # Min_Binary_Heap.new
for i in 0..99999
mh.push(rand(9572943))
end
t = Time.now
for i in 0..99999
mh.pop
end
t = Time.now - t
print "#{t}s"
The version I have performs the same popping operations on 100,000 values in ~2.2s, which I thought was extremely slow, but this won't even finish running. Is that expected or am I doing something wrong?

I don't think you are doing something wrong.
Looking at the source (https://github.com/kanwei/algorithms/blob/master/lib/containers/heap.rb), put a puts statement for when you finish setting up the heap. It looks like a very memory intensive operation to put the elements in (potentially resorting each time), so it might help you working through it.
I'm also not sure about him creating a node class for each actual node. Since they won't get cleaned up, there's going to be around 100,000 objects in memory by the time you are done.
Not sure how much help that is, maybe see how the source differs from your attempt?

Related

Looking for a more efficient way to pull data from multiple datasets in SAS

I'm trying to find a more efficient and speedier way (if possible) to pull subsets of observations that meet certain criteria from multiple hospital claims datasets in SAS. A simplified but common type of data pull would look like this:
data out.qualifying_patients;
set in.state1_2017
in.state1_2018
in.state1_2019
in.state1_2020
in.state2_2017
in.state2_2018
in.state2_2019
in.state2_2020;
array prcode{*} I10_PR1-I10_PR25;
do i=1 to 25;
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then cohort=1;
end;
if cohort=1 then output;
run;
Now imagine that instead of 2 states and 4 years we have 18 states and 9 years -- each about 1GB in size. The code above works fine but it takes FOREVER to run on our non-optimized server setup. So I'm looking for alternate methods to perform the same task but hopefully at a faster clip.
I've tried including (KEEP=) or (DROP=) statements for each dataset included the SET statement to limit the variables being scanned, but this really didn't have much of an impact on speed -- and, for non-coding-related reasons, we pretty much need to pull all the variables.
I've also experimented a bit with hash tables but it's too much to store in memory so that didn't seem to solve the issue. This also isn't a MERGE issue which seems to be what hash tables excel at.
Any thoughts on other approaches that might help? Every data pull we do contains customized criteria for a given project, but we do these pulls a lot and it seems really inefficient to constantly be processing thru the same datasets over and over but not benefitting from that. Thanks for any help!
I happend to have a 1GB dataset on my compute, I tried several times, it takes SAS no more than 25 seconds to set the dataset 8 times. I think the set statement is too simple and basic to improve its efficient.
I think the issue may located at the do loop. Your program runs do loop 25 times for each record, may assigns to cohort more than once, which is not necessary. You can change it like:
do i=1 to 25 until(cohort=1);
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then cohort=1;
end;
This can save a lot of do loops.
First, parallelization will help immensely here. Instead of running 1 job, 1 dataset after the next; run one job per state, or one job per year, or whatever makes sense for your dataset size and CPU count. (You don't want more than 1 job per CPU.). If your server has 32 cores, then you can easily run all the jobs you need here - 1 per state, say - and then after that's done, combine the results together.
Look up SAS MP Connect for one way to do multiprocessing, which basically uses rsubmits to submit code to your own machine. You can also do this by using xcmd to literally launch SAS sessions - add a parameter to the SAS program of state, then run 18 of them, have them output their results to a known location with state name or number, and then have your program collect them.
Second, you can optimize the DO loop more - in addition to the suggestions above, you may be able to optimize using pointers. SAS stores character array variables in memory in adjacent spots (assuming they all come from the same place) - see From Obscurity to Utility:
ADDR, PEEK, POKE as DATA Step Programming Tools from Paul Dorfman for more details here. On page 10, he shows the method I describe here; you PEEKC to get the concatenated values and then use INDEXW to find the thing you want.
data want;
set have;
array prcode{*} $8 I10_PR1-I10_PR25;
found = (^^ indexw (peekc (addr(prcode[1]), 200 ), '0DTJ0ZZ')) or
(^^ indexw (peekc (addr(prcode[1]), 200 ), '0DTJ4ZZ'))
;
run;
Something like that should work. It avoids the loop.
You also could, if you want to keep the loop, exit the loop once you run into an empty procedure code. Usually these things don't go all 25, at least in my experience - they're left-filled, so I10_PR1 is always filled, and then some of them - say, 5 or 10 of them - are filled, then I10_PR11 and on are empty; and if you hit an empty one, you're all done for that round. So not just leaving when you hit what you are looking for, but also leaving when you hit an empty, saves you a lot of processing time.
You probably should consider a hardware upgrade or find someone who can tune your server. This paper suggests tips to improve the processing of large datasets.
Your code is pretty straightforward. The only suggestion is to kill the loop as soon as the criteria is met to avoid wasting unnecessary resources.
do i=1 to 25;
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then do;
output; * cohort criteria met so output the row;
leave; * exit the loop immediately;
end;
end;

Find the Run Time of Select Ruby Code

Problem
Howdy guys, so I want to find the run time of a block of code in Ruby, but I am not entirely sure as to how I could do it. I want to run some code, and then output how long it took to run that code because I have a super huge program and the run time changes a lot. I want to make sure it always has a consistent run time (I could do it by sleeping it for a fraction of a second) but that isn't my problem. I want to find out how long the run time actually is so the program can know if it needs to slow things down or speed things up.
My Thoughts
So, I have an idea as to how it could work. I have never used Time in ruby but I have an idea as to how I could use that. I could have a variable equal to the time (in milliseconds) and then another variable that I make at the end of the code block that does it again, and then I just subtract them, but I have (1) never used Time and (2) I don't actually know if that is the best way.
Thanks in advance!
Ruby has the Benchmark module for timing how long things take. I've never used this outside of seeing if a method is taking too long to run, etc. in development, not sure if this is 'recommended' for production code or for keeping things above a minimum runtime (as it sounds like you might be doing), but take a look and see how it feels for your use case.
It also sounds like you might be interested in the Timeout module as well (for making sure things don't take longer than a set amount of time).
If you really have a use case for making sure something takes a minimum amount of time, timing the code (either using a Benchmark method or just Time or another solution) and then sleep the difference is the only thing that comes to mind.
It is simple. Look at your watch (Time.now) and remember the time, run the code, look at your watch again, subtract.
t0 = Time.now
# your block of code
puts Time.now - t0
[http://ruby-doc.org/core-1.9.3/Time.html
You want to to use the Time object. (Time Docs)
For example,
start = Time.now
# code to time
finish = Time.now
diff = finish - start
diff would be in seconds, as a floating point number.
EDIT: end is reserved.
or you can use
require 'benchmark'
def foo
time = Benchmark.measure {
code to test
}
puts time.real #or save it to logs
end
Sample output:
2.2.3 :001 > foo
5.230000 0.020000 5.250000 ( 5.274806)
Values are CPU time, system time, total and real elapsed time.
[http://ruby-doc.org/stdlib-2.0.0/libdoc/benchmark/rdoc/Benchmark.html#method-c-bm
Source: Ruby docs.

How can I increase the performance of watir-webdriver automated scripts

The main problem I'm having is pulling data from tables, but any other general tips would be welcome too. The tables I'm dealing with have roughly 25 columns and varying numbers of rows (anywhere from 5-50).
Currently I am grabbing the table and converting it to an array:
require "watir-webdriver"
b = Watir::Browser.new :chrome
b.goto "http://someurl"
# The following operation takes way too long
table = b.table(:index, 1).to_a
# The rest is fast enough
table.each do |row|
# Code for pulling data from about 15 of the columns goes here
# ...
end
b.close
The operation table = b.table(:index, 5).to_a takes over a minute when the table has 20 rows. It seems like it should be very fast to put the cells of a 20 X 25 table into an array. I need to do this for over 80 tables, so it ends up taking 1-2 hours to run. Why is it taking so long and how can I improve the speed?
I have tried iterating over the table rows without first converting to an array as well, but there was no improvement in performance:
b.table(:index, 1).rows.each do |row|
# ...
Same results using Windows 7 and Ubuntu. I've also tried Firefox instead of Chrome without a noticeable difference.
A quick workaround would be to use Nokogiri if you're just reading data from a big page:
require 'nokogiri'
doc = Nokogiri::HTML.parse(b.table(:index, 1).html))
I'd love to see more detail though. If you can provide a code + HTML example that demonstrates the issue, please file it in the issue tracker.
The #1 thing you can do to improve the performance of a script that uses watir is to reduce the number of remote calls into the browser. Each time you locate or operate on a DOM element, that's a call into the browser and can take 5ms or more.
In your case, you can reduce the number of remote calls by doing the work on the browser side via execute_script() and checking the result on the ruby side.
When attempting to improve the speed of your code it's vital to have some means of testing execution times (e.g. ruby benchmark). You might also like to look at ruby-prof to get a detailled breakdown of the time spent in each method.
I would start by trying to establish if it's not the to_a method rather than the table that's causing the delays on that line of code. Watir's internals (or nokogiri as per jarib's answer) may be quicker.

Run rake script for specific time range

i need to run rake script for specific time. For example 10 minutes, 1 hour, etc., and if the script it's not finished stop it anyway.
Why i need this ? Because after some hours the memory is full!
Any suggestion ?
Thanks
I think you should first consider what is making this script use up so much memory.
(one thing would be loading up lots of records from the database, and appending them to an array)
But assuming you have already done everything you can,
I'd do something like this.
LIVE_FOR = 1.hour
def run!
finish_before = LIVE_FOR.from_now
array = get_the_array # some big collection to operate on
array.each do |object|
while Time.now < finish_before
...
end
end
end
But really, i'd first try to tackle why you have a memory leak.

Increasing the Loading Speed of Large Files

There are two large text files (Millions of lines) that my program uses. These files are parsed and loaded into hashes so that the data can be accessed quickly. The problem I face is that, currently, the parsing and loading is the slowest part of the program. Below is the code where this is done.
database = extractDatabase(#type).chomp("fasta") + "yml"
revDatabase = extractDatabase(#type + "-r").chomp("fasta.reverse") + "yml"
#proteins = Hash.new
#decoyProteins = Hash.new
File.open(database, "r").each_line do |line|
parts = line.split(": ")
#proteins[parts[0]] = parts[1]
end
File.open(revDatabase, "r").each_line do |line|
parts = line.split(": ")
#decoyProteins[parts[0]] = parts[1]
end
And the files look like the example below. It started off as a YAML file, but the format was modified to increase parsing speed.
MTMDK: P31946 Q14624 Q14624-2 B5BU24 B7ZKJ8 B7Z545 Q4VY19 B2RMS9 B7Z544 Q4VY20
MTMDKSELVQK: P31946 B5BU24 Q4VY19 Q4VY20
....
I've messed around with different ways of setting up the file and parsing them, and so far this is the fastest way, but it's still awfully slow.
Is there a way to improve the speed of this, or is there a whole other approach I can take?
List of things that don't work:
YAML.
Standard Ruby threads.
Forking off processes and then retrieving the hash through a pipe.
In my usage, reading all or part the file into memory before parsing usually goes faster. If the database sizes are small enough this could be as simple as
buffer = File.readlines(database)
buffer.each do |line|
...
end
If they're too big to fit into memory, it gets more complicated, you have to setup block reads of data followed by parse, or threaded with separate read and parse threads.
Why not use the solution devised through decades of experience: a database, say SQLlite3?
(To be different, although I'd first recommend looking at (Ruby) BDB and other "NoSQL" backend-engines, if they fit your need.)
If fixed-sized records with a deterministic index are used then you can perform a lazy-load of each item through a proxy object. This would be a suitable candidate for a mmap. However, this will not speed up the total access time, but will merely amortize the loading throughout the life-cycle of the program (at least until first use and if some data is never used then you get the benefit of never loading it). Without fixed-sized records or deterministic index values this problem is more complex and starts to look more like a traditional "index" store (eg. a B-tree in an SQL back-end or whatever BDB uses :-).
The general problems with threading here are:
The IO will likely be your bottleneck around Ruby "green" threads
You still need all the data before use
You may be interested in the Widefinder Project, just in general "trying to get faster IO processing".
I don't know too much about Ruby but I have had to deal with the problem before. I found the best way was to split the file up into chunks or separate files then spawn threads to read each chunk in at a single time. Once the partitioned files are in memory combining the results should be fast. Here is some information on Threads in Ruby:
http://rubylearning.com/satishtalim/ruby_threads.html
Hope that helps.

Resources