I have a lot pictures some of are totally identical except the file name, currently I group them by calculating each pic's MD5, but it looks very slow to hash each of them. Is there any alternative way to make it faster? Will it help if I resize the image before hash?
You could group files by [filesize, partial hashcode], "partial hashcode" being a hash for (say) some block of [N, filesize].min bytes in the file (e.g., at the beginning or end of the file). Naturally, the choice of N affects the probability of two different files being grouped together, but that might be acceptable if the probability and/or cost of creating an erroneous grouping are sufficiently small.
MD5 the pic's with multiple processes if you're using CRuby, or multiple thread if you're using Rubinius or JRuby
Multiprocess
workers = 4 # >= number of CPU cores
file_groups = Dir['/path/to/pic/folder/*'].each_with_index.group_by{|filename, i| i % workers}.values
file_groups.each do |group|
fork do
group.each do |filename, _|
# MD5 the file
end
end
end
Process.waitall
Multithread
workers = 4 # >= number of CPU cores
file_groups = Dir['/path/to/pic/folder/*'].each_with_index.group_by{|filename, i| i % workers}.values
threads = file_groups.map do |group|
Thread.new do
group.each do |filename, _|
# MD5 the file
end
end
end
threads.each(&:join)
Related
Is it thread safe to share Ruby Hash between threads and modify it in each thread, having guarantee each thread modifies different key (appends new hash with undetermined, before execution, number of keys to it)?
I know it's not thread-safe to do it, if threads modify the same key, however, I am not sure if it's safe if they modify different keys.
e.g. below is an example program that might illustrate the problem:
#!/usr/bin/env ruby
# frozen_string_literal: true
array = [*1..100]
hash = {}
array.each do |element|
hash[element] = {}
end
threads = []
array.each do |element|
threads << Thread.new do
random = rand(1..100)
hash_new_keys = [*0..random]
hash[element] = {}
hash_new_keys.each do |key|
hash[element][key] = rand(1..10)
end
end
end
threads.each(&:join)
If you use MRI then its thread-safe to modify array/hash in different threads. GIL guarantees that only one thread is active at the time.
Here's 5 threads sharing one Array object. Each thread pushes nil into the Array 1000 times:
array = []
5.times.map do
Thread.new do
1000.times do
array << nil
end
end
end.each(&:join)
puts array.size
$ ruby pushing_nil.rb
5000
$ jruby pushing_nil.rb
4446
$ rbx pushing_nil.rb
3088
Because MRI has a GIL, even when there are 5 threads running at once,
only one thread is active at a time. In other words, things aren't
truly parallel. JRuby and Rubinius don't have a GIL, so when you have
5 threads running, you really have 5 threads running in parallel
across the available cores.
On the parallel Ruby implementations, the 5 threads are stepping
through code that's not thread-safe. They end up interrupting each
other and, ultimately, corrupting the underlying data.
Ref https://www.jstorimer.com/blogs/workingwithcode/8085491-nobody-understands-the-gil
In preparation for manipulation of a large chunk of data, I perform some cleaning and pruning operations prior to processing it. The code I have functions fine, but I'm worried about how these operations will scale up when I perform this on millions of points. The code I have feels inefficient but I don't know how to simplify my steps.
In my process, I parse in a CSV file, check for garbage data i.e. non-numerical values, typecast the remaining data to floats, and then sort it. I'm hoping for some guidance on how to improve this if possible.
require 'green_shoes'
require 'csv'
class String
def valid_float?
true if Float self rescue false
end
end
puts "Parsing..."
temp_file = ask_open_file("")
temp_arr = CSV.read(temp_file)
temp_arr.each do |temp_row|
temp_row.map!{ |x| !x.valid_float? ? 0 : x }
temp_row.map!{ |x| x.to_f}
temp_row.sort!
end
My guess is that you'd want to return the file contents when you are done, right? If so, you'll want to use map on temp_arr, instead of each.
You can save an iteration by combining first two lines together:
temp_arr.map! do |temp_row|
temp_row.map!{ |x| x.valid_float? ? x.to_f : 0.0 }
temp_row.sort!
end
our program creates a master hash where each key is a symbol representing an ID (about 10-20 characters). each value is an empty hash.
the master hash has about 800K records.
yet we're seeing ruby memory hit almost 400MB.
this suggests each key/value pair (symbol + empty hash) consumes ~500B each.
is this normal for ruby?
code below:
def load_app_ids
cols = get_columns AppFile
id_col = cols[:application_id]
each_record AppFile do |r|
#apps[r[id_col].intern] = {}
end
end
# Takes a line, strips the record seperator, and return
# an array of fields
def split_line(line)
line.gsub(RecordSeperator, "").split(FieldSeperator)
end
# Run a block on each record in a file, up to
# #limit records
def each_record(filename, &block)
i = 0
path = File.join(#dir, filename)
File.open(path, "r").each_line(RecordSeperator) do |line|
# Get the line split into columns unless it is
# a comment
block.call split_line(line) unless line =~ /^#/
# This import can take a loooong time.
print "\r#{i}" if (i+=1) % 1000 == 0
break if #limit and i >= #limit
end
print "\n" if i > 1000
end
# Return map of column name symbols to column number
def get_columns(filename)
path = File.join(#dir, filename)
description = split_line(File.open(path, &:readline))
# Strip the leading comment character
description[0].gsub!(/^#/, "")
# Return map of symbol to column number
Hash[ description.map { |str| [ str.intern, description.index(str) ] } ]
end
I would say this is normal for Ruby. I don't have metrics for space used by each data structure, but in general basic Ruby works poorly on this kind of large structure. It has to allow for the fact that the keys and values can be any kind of object for instance, and although that is very flexible for high-level coding, it's inefficient when you don't need such arbitrary control.
If I do this in irb
h = {}
800000.times { |x| h[("test" + x.to_s).to_sym] = {} }
I get a process with 197 Mb used.
Your process has claimed more space as it created large numbers of hashes during processing - one for each row. Ruby will eventually clean up - but that doesn't happen immediately, and the memory is not returned to the OS immediately either.
Edit: I should add that I have been working with large data structures of various kinds in Ruby - the general approach if you need them is to find something coded in native extensions (or ffi) where the code can take advantage of using restricted types in an array for example. The gem narray is a good example of this for numeric arrays, vectors, matrices etc.
I want to load an array with 1million guids, then loop through them and perform some string operations on each element of the array.
I only want to benchmark the time for the operations I perform on each element of the array, not the time it takes to initialize the array with 1 million rows.
I tried doing a benchmark before, but I didn't understand the output.
How would you do this, I have:
rows = []
(1..1000000).each do |x|
rows[x] = // some string data
end
n = 50000
Benchmark.bm do |x|
rows.each do |r|
# perform string logic here
end
end
Will this return consistent results?
Any guidance/gotcha's I should know about?
Yes, this will return consistent results. You need to report the benchmark, however and (if processing a million rows is too fast) you will need to use your n variable to iterate a few times. (Start with a low n and increase it if your times are in the tenth or hundredths of a second).
require 'benchmark'
# Prepare your test data here
n = 1
Benchmark.bm do |x|
x.report('technique 1') do
n.times do
# perform your string logic here
end
end
x.report('technique 2') do
n.times do
# perform your alternative logic here
end
end
end
Make sure you run your multiple comparisons in the same Benchmark block; don't write one attempt, write down the numbers, and then change the code to run it again. Not only is that more work for you, but it also may produce incorrect comparisons if your machine is in a different state (or if, heaven forfend, you run one test on one machine and another test on another machine).
I'm definitely a newbie to ruby (and using 1.9.1), so any help is appreciated. Everything I've learned about Ruby has been from using google. I'm trying to compare two arrays of hashes and due to the sizes, it's taking way to long and flirts with running out of memory. Any help would be appreciated.
I have a Class (ParseCSV) with multiple methods (initialize, open, compare, strip, output).
The way I have it working right now is as follows (and this does pass the tests I've written, just using a much smaller data set):
file1 = ParseCSV.new(“some_file”)
file2 = ParseCSV.new(“some_other_file”)
file1.open #this reads the file contents into an Array of Hash’s through the CSV library
file1.strip #This is just removing extra hash’s from each array index. So normally there are fifty hash’s in each array index, this is just done to help reduce memory consumption.
file2.open
file2.compare(“file1.storage”) ##storage is The array of hash’s from the open method
file2.output
Now what I’m struggling with is the compare method. Working on smaller data sets it’s not a big deal at all, works fast enough. However in this case I’m comparing about 400,000 records (all read into the array of hashes) against one that has about 450,000 records. I’m trying to speed this up. Also I can’t run the strip method on file2. Here is how I’m doing it now:
def compare(x)
#obviously just a verbose message
puts "Comparing and leaving behind non matching entries"
x.each do |row|
##storage is the array of hashes
#storage.each_index do |y|
if row[#opts[:field]] == #storage[y][#opts[:field]]
#storage.delete_at(y)
end
end
end
end
Hopefully that makes sense. I know it’s going to be a slow process just because it has to iterate 400,000 rows 440,000 times each. But do you have any other ideas on how to speed it up and possibly reduce memory consumption?
Yikes, that'll be O(n^2) runtime. Nasty.
A better bet would be to use the built in Set class.
Code would look something like:
require 'set'
file1_content = load_file_content_into_array_here("some_file")
file2_content = load_file_content_into_array_here("some_other_file")
file1_set = Set[file1_content]
unique_elements = file1_set - file2_content
That assumes that the files themselves have unique content. Should work in the generic case, but may have quirks depending on what your data looks like and how you parse it, but as long as the lines can be compared with == it should help you out.
Using a set will be MUCH faster than doing a nested loop to iterate over the file content.
(and yes, I have actually done this to process files with ~2 million lines, so it should be able to handle your case - eventually. If you're doing heavy data munging, Ruby may not be the best choice of tool though)
Here's a script comparing two ways of doing it: Your original compare() and a new_compare(). The new_compare uses more of the built in Enumerable methods. Since they are implemented in C, they'll be faster.
I created a constant called Test::SIZE to try out the benchmarks with different hash sizes. Results at the bottom. The difference is huge.
require 'benchmark'
class Test
SIZE = 20000
attr_accessor :storage
def initialize
file1 = []
SIZE.times { |x| file1 << { :field => x, :foo => x } }
#storage = file1
#opts = {}
#opts[:field] = :field
end
def compare(x)
x.each do |row|
#storage.each_index do |y|
if row[#opts[:field]] == #storage[y][#opts[:field]]
#storage.delete_at(y)
end
end
end
end
def new_compare(other)
other_keys = other.map { |x| x[#opts[:field]] }
#storage.reject! { |s| other_keys.include? s[#opts[:field]] }
end
end
storage2 = []
# We'll make 10 of them match
10.times { |x| storage2 << { :field => x, :foo => x } }
# And the rest wont
(Test::SIZE-10).times { |x| storage2 << { :field => x+100000000, :foo => x} }
Benchmark.bm do |b|
b.report("original compare") do
t1 = Test.new
t1.compare(storage2)
end
end
Benchmark.bm do |b|
b.report("new compare") do
t1 = Test.new
t1.new_compare(storage2)
end
end
Results:
Test::SIZE = 500
user system total real
original compare 0.280000 0.000000 0.280000 ( 0.285366)
user system total real
new compare 0.020000 0.000000 0.020000 ( 0.020458)
Test::SIZE = 1000
user system total real
original compare 28.140000 0.110000 28.250000 ( 28.618907)
user system total real
new compare 1.930000 0.010000 1.940000 ( 1.956868)
Test::SIZE = 5000
ruby test.rb
user system total real
original compare113.100000 0.440000 113.540000 (115.041267)
user system total real
new compare 7.680000 0.020000 7.700000 ( 7.739120)
Test::SIZE = 10000
user system total real
original compare453.320000 1.760000 455.080000 (460.549246)
user system total real
new compare 30.840000 0.110000 30.950000 ( 31.226218)