Guidance with benchmarking some string splitting etc - ruby

I want to load an array with 1million guids, then loop through them and perform some string operations on each element of the array.
I only want to benchmark the time for the operations I perform on each element of the array, not the time it takes to initialize the array with 1 million rows.
I tried doing a benchmark before, but I didn't understand the output.
How would you do this, I have:
rows = []
(1..1000000).each do |x|
rows[x] = // some string data
end
n = 50000
Benchmark.bm do |x|
rows.each do |r|
# perform string logic here
end
end
Will this return consistent results?
Any guidance/gotcha's I should know about?

Yes, this will return consistent results. You need to report the benchmark, however and (if processing a million rows is too fast) you will need to use your n variable to iterate a few times. (Start with a low n and increase it if your times are in the tenth or hundredths of a second).
require 'benchmark'
# Prepare your test data here
n = 1
Benchmark.bm do |x|
x.report('technique 1') do
n.times do
# perform your string logic here
end
end
x.report('technique 2') do
n.times do
# perform your alternative logic here
end
end
end
Make sure you run your multiple comparisons in the same Benchmark block; don't write one attempt, write down the numbers, and then change the code to run it again. Not only is that more work for you, but it also may produce incorrect comparisons if your machine is in a different state (or if, heaven forfend, you run one test on one machine and another test on another machine).

Related

How to test a loop with shuffle in Ruby?

How to test in Ruby
if shuffle
some_array.shuffle.each { |val| puts "#{val}" }
end
Do I need to test shuffle or no since it is a Ruby method? Thanks.
Short answer: No.
You can trust that Ruby will do things correctly. It has a huge number of tests already.
Long answer: Yes.
You shouldn't be testing the shuffle method directly, but testing that your code produces the correct results.
Since your code uses puts this makes it very annoying to test. If you can write a method that returns values that can be printed, that's usually a lot better. When writing code always think about how you can test it.
If you're struggling with that, where the way to test something isn't clear, write the tests first and then write code to make them pass.
If it's imperative that your values be shuffled then you'll need to come up with a way of determining if they're sufficiently shuffled. This can be difficult since randomness is a fickle thing. There's a small but non-zero chance that shuffle does nothing to your data, that's how randomness works. This probability grows considerably the smaller your list is to the point where it's guaranteed to do nothing with just one element.
So if you can describe why the data should be shuffled, and what constitutes a good shuffling, then you can write a test for this.
Here's an example of how to do that:
gem 'test-unit'
require 'test/unit'
class MyShuffler
def initialize(data)
#data = data
end
def processed
#data.map do |e|
e.downcase
end.shuffle
end
end
Now you can use this like this:
shuffler = MyShuffler.new(%w[ a b c d e f ])
# Thin presentation layer here where we're just displaying each
# element. Test code for this is not strictly necessary.
shuffler.processed.each do |e|
puts e
end
Now you write test code for the data manipulation in isolation, not the presentation part:
gem 'test-unit'
require 'test/unit'
class MyShufflerTest < Test::Unit::TestCase
def test_processed
shuffler = MyShuffler.new(%w[ A B c Dee e f Gee ])
results = shuffler.processed
expected = %w[ a b c dee e f gee ]
assert_equal expected, results.sort
assert_not_equal expected, results
counts = Hash.new(0)
iterations = 100000
# Keep track of the number of times a particular element appears in
# the first entry of the array.
iterations.times do
counts[shuffler.processed[0]] += 1
end
expected_count = iterations / expected.length
# The count for any given element should be +/- 5% versus the expected
# count. The variance generally decreases with a larger number of
# iterations.
expected.each do |e|
assert (counts[e] - expected_count).abs < iterations * 0.05
end
end
end
You should not test randomness using unit test. A unit test should call a method and test the returned value (or object state) against an expected value. The problem with testing randomness is that there isn't an expected value for most of the things you'd like to test.

Ruby Array Operations: Increasing Efficiency

In preparation for manipulation of a large chunk of data, I perform some cleaning and pruning operations prior to processing it. The code I have functions fine, but I'm worried about how these operations will scale up when I perform this on millions of points. The code I have feels inefficient but I don't know how to simplify my steps.
In my process, I parse in a CSV file, check for garbage data i.e. non-numerical values, typecast the remaining data to floats, and then sort it. I'm hoping for some guidance on how to improve this if possible.
require 'green_shoes'
require 'csv'
class String
def valid_float?
true if Float self rescue false
end
end
puts "Parsing..."
temp_file = ask_open_file("")
temp_arr = CSV.read(temp_file)
temp_arr.each do |temp_row|
temp_row.map!{ |x| !x.valid_float? ? 0 : x }
temp_row.map!{ |x| x.to_f}
temp_row.sort!
end
My guess is that you'd want to return the file contents when you are done, right? If so, you'll want to use map on temp_arr, instead of each.
You can save an iteration by combining first two lines together:
temp_arr.map! do |temp_row|
temp_row.map!{ |x| x.valid_float? ? x.to_f : 0.0 }
temp_row.sort!
end

How to identify the same image in a faster way

I have a lot pictures some of are totally identical except the file name, currently I group them by calculating each pic's MD5, but it looks very slow to hash each of them. Is there any alternative way to make it faster? Will it help if I resize the image before hash?
You could group files by [filesize, partial hashcode], "partial hashcode" being a hash for (say) some block of [N, filesize].min bytes in the file (e.g., at the beginning or end of the file). Naturally, the choice of N affects the probability of two different files being grouped together, but that might be acceptable if the probability and/or cost of creating an erroneous grouping are sufficiently small.
MD5 the pic's with multiple processes if you're using CRuby, or multiple thread if you're using Rubinius or JRuby
Multiprocess
workers = 4 # >= number of CPU cores
file_groups = Dir['/path/to/pic/folder/*'].each_with_index.group_by{|filename, i| i % workers}.values
file_groups.each do |group|
fork do
group.each do |filename, _|
# MD5 the file
end
end
end
Process.waitall
Multithread
workers = 4 # >= number of CPU cores
file_groups = Dir['/path/to/pic/folder/*'].each_with_index.group_by{|filename, i| i % workers}.values
threads = file_groups.map do |group|
Thread.new do
group.each do |filename, _|
# MD5 the file
end
end
end
threads.each(&:join)

Why can't I get Array#slice to work the way I want?

I have a big Array of AR model instances. Let's say there are 20K entries in the array. I want to move through that array a chunk of 1,000 at a time.
slice_size = 1000
start = 0
myarray.slice(start, slice_size) do |slice|
slice.each do |item|
item.dostuff
end
start+=slice_size
end
I can replace that whole inner block with just:
puts "hey"
and not see a thing in the console. I have tried this 9 ways from Sunday. And I've done it successfully before, just can't remember where. And I have RTFM. Can anyone help?
The problem is that slice does not take a block, but you are passing it a block, and trying to do something in it, which is ignored. If you do
myarray.slice(start, slice_size).each do |slice|
...
end
then it should work.
But to do it that way is not Ruby-ish. A better way is
myarray.each_slice(slice_size) do |slice|
...
end
If the array can be destroyed, you could do it like this:
((myarray.size+slice_size-1)/slice_size).times.map {myarray.shift(slice_size)}
If not:
((myarray.size+slice_size-1)/slice_size).times.map { |i|
myarray.slice(i*slice_size, slice_size) }
You can use:
Enumerable#each_slice(n) which takes n items at a time;
Array#in_groups_of(n) (if this is Rails) which works like each_slice but will pad the last group to guarantee the group size remains constant;
But I recommend using ActiveRecord's built-in Model.find_each which will batch queries in the DB layer for better performance. It defaults to 1000, but you can specify the batch size. See http://guides.rubyonrails.org/active_record_querying.html#retrieving-multiple-objects-in-batches for more detail.
Example from the guide:
User.find_each(batch_size: 5000) do |user|
NewsLetter.weekly_deliver(user)
end

How to use less memory generating Array permutation?

So I need to get all possible permutations of a string.
What I have now is this:
def uniq_permutations string
string.split(//).permutation.map(&:join).uniq
end
Ok, now what is my problem: This method works fine for small strings but I want to be able to use it with strings with something like size of 15 or maybe even 20. And with this method it uses a lot of memory (>1gb) and my question is what could I change not to use that much memory?
Is there a better way to generate permutation? Should I persist them at the filesystem and retrieve when I need them (I hope not because this might make my method slow)?
What can I do?
Update:
I actually don't need to save the result anywhere I just need to lookup for each in a table to see if it exists.
Just to reiterate what Sawa said. You do understand the scope? The number of permutations for any n elements is n!. It's about the most aggressive mathematical progression operation you can get. The results for n between 1-20 are:
[1, 2, 6, 24, 120, 720, 5040, 40320, 362880, 3628800, 39916800, 479001600,
6227020800, 87178291200, 1307674368000, 20922789888000, 355687428096000,
6402373705728000, 121645100408832000, 2432902008176640000]
Where the last number is approximately 2 quintillion, which is 2 billion billion.
That is 2265820000 gigabytes.
You can save the results to disk all day long - unless you own all the Google datacenters in the world you're going to be pretty much out of luck here :)
Your call to map(&:join) is what is creating the array in memory, as map in effect turns an Enumerator into an array. Depending on what you want to do, you could avoid creating the array with something like this:
def each_permutation(string)
string.split(//).permutation do |permutaion|
yield permutation.join
end
end
Then use this method like this:
each_permutation(my_string) do |s|
lookup_string(s) #or whatever you need to do for each string here
end
This doesn’t check for duplicates (no call to uniq), but avoids creating the array. This will still likely take quite a long time for large strings.
However I suspect in your case there is a better way of solving your problem.
I actually don't need to save the result anywhere I just need to lookup for each in a table to see if it exists.
It looks like you’re looking for possible anagrams of a string in an existing word list. If you take any two anagrams and sort the characters in them, the resulting two strings will be the same. Could you perhaps change your data structures so that you have a hash, with keys being the sorted string and the values being a list of words that are anagrams of that string. Then instead of checking all permutations of a new string against a list, you just need to sort the characters in the string, and use that as the key to look up the list of all strings that are permutations of that string.
Perhaps you don't need to generate all elements of the set, but rather only a random or constrained subset. I have written an algorithm to generate the m-th permutation in O(n) time.
First convert the key to a list representation of itself in the factorial number system. Then iteratively pull out the item at each index specified by the new list and of the old.
module Factorial
def factorial num; (2..num).inject(:*) || 1; end
def factorial_floor num
tmp_1 = 0
1.upto(1.0/0.0) do |counter|
break [tmp_1, counter - 1] if (tmp_2 = factorial counter) > num
tmp_1 = tmp_2 #####
end # #
end # #
end # returns [factorial, integer that generates it]
# for the factorial closest to without going over num
class Array; include Factorial
def generate_swap_list key
swap_list = []
key -= (swap_list << (factorial_floor key)).last[0] while key > 0
swap_list
end
def reduce_swap_list swap_list
swap_list = swap_list.map { |x| x[1] }
((length - 1).downto 0).map { |element| swap_list.count element }
end
def keyed_permute key
apply_swaps reduce_swap_list generate_swap_list key
end
def apply_swaps swap_list
swap_list.map { |index| delete_at index }
end
end
Now, if you want to randomly sample some permutations, ruby comes with Array.shuffle!, but this will let you copy and save permutations or to iterate through the permutohedral space. Or maybe there's a way to constrain the permutation space for your purposes.
constrained_generator_thing do |val|
Array.new(sample_size) {array_to_permute.keyed_permute val}
end
Perhaps I am missing the obvious, but why not do
['a','a','b'].permutation.to_a.uniq!

Resources