Ruby Array Operations: Increasing Efficiency - ruby

In preparation for manipulation of a large chunk of data, I perform some cleaning and pruning operations prior to processing it. The code I have functions fine, but I'm worried about how these operations will scale up when I perform this on millions of points. The code I have feels inefficient but I don't know how to simplify my steps.
In my process, I parse in a CSV file, check for garbage data i.e. non-numerical values, typecast the remaining data to floats, and then sort it. I'm hoping for some guidance on how to improve this if possible.
require 'green_shoes'
require 'csv'
class String
def valid_float?
true if Float self rescue false
end
end
puts "Parsing..."
temp_file = ask_open_file("")
temp_arr = CSV.read(temp_file)
temp_arr.each do |temp_row|
temp_row.map!{ |x| !x.valid_float? ? 0 : x }
temp_row.map!{ |x| x.to_f}
temp_row.sort!
end

My guess is that you'd want to return the file contents when you are done, right? If so, you'll want to use map on temp_arr, instead of each.
You can save an iteration by combining first two lines together:
temp_arr.map! do |temp_row|
temp_row.map!{ |x| x.valid_float? ? x.to_f : 0.0 }
temp_row.sort!
end

Related

Why can't I get Array#slice to work the way I want?

I have a big Array of AR model instances. Let's say there are 20K entries in the array. I want to move through that array a chunk of 1,000 at a time.
slice_size = 1000
start = 0
myarray.slice(start, slice_size) do |slice|
slice.each do |item|
item.dostuff
end
start+=slice_size
end
I can replace that whole inner block with just:
puts "hey"
and not see a thing in the console. I have tried this 9 ways from Sunday. And I've done it successfully before, just can't remember where. And I have RTFM. Can anyone help?
The problem is that slice does not take a block, but you are passing it a block, and trying to do something in it, which is ignored. If you do
myarray.slice(start, slice_size).each do |slice|
...
end
then it should work.
But to do it that way is not Ruby-ish. A better way is
myarray.each_slice(slice_size) do |slice|
...
end
If the array can be destroyed, you could do it like this:
((myarray.size+slice_size-1)/slice_size).times.map {myarray.shift(slice_size)}
If not:
((myarray.size+slice_size-1)/slice_size).times.map { |i|
myarray.slice(i*slice_size, slice_size) }
You can use:
Enumerable#each_slice(n) which takes n items at a time;
Array#in_groups_of(n) (if this is Rails) which works like each_slice but will pad the last group to guarantee the group size remains constant;
But I recommend using ActiveRecord's built-in Model.find_each which will batch queries in the DB layer for better performance. It defaults to 1000, but you can specify the batch size. See http://guides.rubyonrails.org/active_record_querying.html#retrieving-multiple-objects-in-batches for more detail.
Example from the guide:
User.find_each(batch_size: 5000) do |user|
NewsLetter.weekly_deliver(user)
end

How to use less memory generating Array permutation?

So I need to get all possible permutations of a string.
What I have now is this:
def uniq_permutations string
string.split(//).permutation.map(&:join).uniq
end
Ok, now what is my problem: This method works fine for small strings but I want to be able to use it with strings with something like size of 15 or maybe even 20. And with this method it uses a lot of memory (>1gb) and my question is what could I change not to use that much memory?
Is there a better way to generate permutation? Should I persist them at the filesystem and retrieve when I need them (I hope not because this might make my method slow)?
What can I do?
Update:
I actually don't need to save the result anywhere I just need to lookup for each in a table to see if it exists.
Just to reiterate what Sawa said. You do understand the scope? The number of permutations for any n elements is n!. It's about the most aggressive mathematical progression operation you can get. The results for n between 1-20 are:
[1, 2, 6, 24, 120, 720, 5040, 40320, 362880, 3628800, 39916800, 479001600,
6227020800, 87178291200, 1307674368000, 20922789888000, 355687428096000,
6402373705728000, 121645100408832000, 2432902008176640000]
Where the last number is approximately 2 quintillion, which is 2 billion billion.
That is 2265820000 gigabytes.
You can save the results to disk all day long - unless you own all the Google datacenters in the world you're going to be pretty much out of luck here :)
Your call to map(&:join) is what is creating the array in memory, as map in effect turns an Enumerator into an array. Depending on what you want to do, you could avoid creating the array with something like this:
def each_permutation(string)
string.split(//).permutation do |permutaion|
yield permutation.join
end
end
Then use this method like this:
each_permutation(my_string) do |s|
lookup_string(s) #or whatever you need to do for each string here
end
This doesn’t check for duplicates (no call to uniq), but avoids creating the array. This will still likely take quite a long time for large strings.
However I suspect in your case there is a better way of solving your problem.
I actually don't need to save the result anywhere I just need to lookup for each in a table to see if it exists.
It looks like you’re looking for possible anagrams of a string in an existing word list. If you take any two anagrams and sort the characters in them, the resulting two strings will be the same. Could you perhaps change your data structures so that you have a hash, with keys being the sorted string and the values being a list of words that are anagrams of that string. Then instead of checking all permutations of a new string against a list, you just need to sort the characters in the string, and use that as the key to look up the list of all strings that are permutations of that string.
Perhaps you don't need to generate all elements of the set, but rather only a random or constrained subset. I have written an algorithm to generate the m-th permutation in O(n) time.
First convert the key to a list representation of itself in the factorial number system. Then iteratively pull out the item at each index specified by the new list and of the old.
module Factorial
def factorial num; (2..num).inject(:*) || 1; end
def factorial_floor num
tmp_1 = 0
1.upto(1.0/0.0) do |counter|
break [tmp_1, counter - 1] if (tmp_2 = factorial counter) > num
tmp_1 = tmp_2 #####
end # #
end # #
end # returns [factorial, integer that generates it]
# for the factorial closest to without going over num
class Array; include Factorial
def generate_swap_list key
swap_list = []
key -= (swap_list << (factorial_floor key)).last[0] while key > 0
swap_list
end
def reduce_swap_list swap_list
swap_list = swap_list.map { |x| x[1] }
((length - 1).downto 0).map { |element| swap_list.count element }
end
def keyed_permute key
apply_swaps reduce_swap_list generate_swap_list key
end
def apply_swaps swap_list
swap_list.map { |index| delete_at index }
end
end
Now, if you want to randomly sample some permutations, ruby comes with Array.shuffle!, but this will let you copy and save permutations or to iterate through the permutohedral space. Or maybe there's a way to constrain the permutation space for your purposes.
constrained_generator_thing do |val|
Array.new(sample_size) {array_to_permute.keyed_permute val}
end
Perhaps I am missing the obvious, but why not do
['a','a','b'].permutation.to_a.uniq!

How can I read a word list in chunks of 100?

I want to read words in chunk of 100 from a file and then process them.
I can do it adding additional counter etc, but is there a in-build command in one of the IO libs that does this. I wasnt able to find it
require 'pp'
arr = []
i = 0
f=File.open("/home/pboob/Features/KB/178/synthetic/dataCreation/uniqEnglish.out").each(" ") { |word|
i=i+1
arr << word
if i==100
pp arr
arr.clear
i=0
end
}
pp arr
Thanks!
P.S:
The file is too big to fit in memory, so I will have to use ".each "
The file is too big to fit in memory, so I will have to use ".each "
Better than each, laziness with enumerable-lazy:
require 'enumerable/lazy'
result = open('/tmp/foo').lines.lazy.map(&:chomp).each_slice(100).map do |group_of_words|
# f(groups_of words)
end
More on functional programming and laziness here.
Actually, I believe the implementation of "each_slice" is sufficiently lazy for your purposes. Try this:
open('tmp/foo').lines.each_slice(100) do |lines|
lines = lines.collect &:chomp # optional
# do something with lines
end
Not as elegant as tokland's solution but it avoids adding an extra dependency to your app, which is always nice.
I think this might be useful to you:
http://blog.davidegrayson.com/2012/03/ruby-enumerable-module.html
Assuming one word per line, and the ability to slurp an entire file into memory:
IO.readlines('/tmp/foo').map(&:chomp).each_slice(100).to_a
If you are memory-constrained, then you can interate in chunks by specifying only the chunk size; no counter required!
File.open('/tmp/foo') do |f|
chunk = []
f.each do |line|
chunk.push(line)
next unless f.eof? or chunk.size == 100
puts chunk.inspect
chunk.clear
end
end
That's pretty verbose, though it does make it clear what's going on with the chunking. If you don't mind being less explicit, you can still use slicing with an Enumerator:
File.open('/tmp/foo').lines.map(&:chomp).each_slice(100) {|words| p words}
and replace the block with whatever processing you want to perform on each chunk.
Maybe it's more straightforward to do:
File.open(filename) do |file|
do_things(100.times.map{file.gets ' '}) until file.eof?
end

Equivalent to find_each in DataMapper

Is there a way to do the equivalent of ActiveRecord#find_each in DataMapper ?
(find_each will iterate over the result of a query by fetching things in memory by batch of 1000 rather than loading everything in memory)
I checked dm-chunked_query as #MichaelKohl suggested, but I couldn't make it work as I'd expect, it gets the whole collection (I'd expect it to use OFFSET+LIMIT). So I wrote my own extension, it's pretty simple, hope it helps:
class DataMapper::Collection
def batch(n)
Enumerator.new do |y|
offset = 0
loop do
records = slice(offset, n)
break if records.empty?
records.each { |record| y.yield(record) }
offset += records.size
end
end
end
end
# Example
Model.all(:order => :id.asc).batch(1000).each { |obj| p obj }
I don't DM much, but it would not be that hard to write your own, assuming DM lets you apply your own 'limit'and 'offset' manually to queries.
Check out the implementation of find_each/find_in_batches in AR, only a couple dozen lines.
https://github.com/rails/rails/blob/master/activerecord/lib/active_record/relation/batches.rb#L19
https://github.com/rails/rails/blob/master/activerecord/lib/active_record/relation/batches.rb#L48

Guidance with benchmarking some string splitting etc

I want to load an array with 1million guids, then loop through them and perform some string operations on each element of the array.
I only want to benchmark the time for the operations I perform on each element of the array, not the time it takes to initialize the array with 1 million rows.
I tried doing a benchmark before, but I didn't understand the output.
How would you do this, I have:
rows = []
(1..1000000).each do |x|
rows[x] = // some string data
end
n = 50000
Benchmark.bm do |x|
rows.each do |r|
# perform string logic here
end
end
Will this return consistent results?
Any guidance/gotcha's I should know about?
Yes, this will return consistent results. You need to report the benchmark, however and (if processing a million rows is too fast) you will need to use your n variable to iterate a few times. (Start with a low n and increase it if your times are in the tenth or hundredths of a second).
require 'benchmark'
# Prepare your test data here
n = 1
Benchmark.bm do |x|
x.report('technique 1') do
n.times do
# perform your string logic here
end
end
x.report('technique 2') do
n.times do
# perform your alternative logic here
end
end
end
Make sure you run your multiple comparisons in the same Benchmark block; don't write one attempt, write down the numbers, and then change the code to run it again. Not only is that more work for you, but it also may produce incorrect comparisons if your machine is in a different state (or if, heaven forfend, you run one test on one machine and another test on another machine).

Resources