How to handle thread returns Ruby - ruby

I have a ruby script in which I'm parsing a large csv file. Ihave everything handled and working fairly well, except for how to deal with the thread's return values. I have:
length = (ses.size/4).ceil
ses.each_slice(length) do |x|
threads << Thread.new { a,b = splat x }
end
threads.each { |thr|
thr.join
}
'splat' returns to temp files that need to appended to the output files out1 and out2. I'm stumbling on where exactly to do that/how to get that information. If someone could point me in the right direction that'd be great.

Two things, first, when you pass the 'x' into the thread, it's safer to make it thread-local by changing this:
threads << Thread.new { a,b = splat x }
Into this:
threads << Thread.new(x) { |x| a,b = splat x }
Next, to get the return value out, you join using :value.
So here's a quick demo I whipped up:
dummy = [
['a.txt', 'b.txt'],
['c.txt', 'd.txt'],
['e.txt', 'f.txt'],
['g.txt', 'h.txt'],
['i.txt', 'j.txt'],
['k.txt', 'l.txt']
]
threads = dummy.map do |pair|
Thread.new(pair) { |val| val }
end
vals = threads.map(&:value) # equiv. to 'threads.map { |t| t.value }'
puts vals.inspect
Crib off that and it should get you where you want to go.

Related

Accessing thread variables of threads stored in instance variables

This is expected,
t = Thread.new{
Thread.current[:rabbit] = 'white'
}
##### t[:rabbit] = white
But I can't understand this:
class Whatever
def initialize
#thd = Thread.new{
Thread.current[:apple] = 'whatever'
}
end
def apple
#thd[:apple]
end
def thd
#thd
end
end
I want to access these, why are they nil?
Whatever.new.apple # nil
Whatever.new.thd[:apple] # nil
Whatever.new.thd.thread_variable_get(:apple) # nil
Why does this happen? How can I access #thd Thread variables?
What you're seeing here is a race condition. You're attempting to read the thread variable before the body of the thread has been run.
Compare the following:
w = Whatever.new
w.apple
# => nil
w = Whatever.new
sleep 0.1
w.apple
# => "whatever"
Whether or not the thread body gets run in time with Whatever.new.apple is pretty much random, it seems to happen 0.1% of the time for me, but this is probably different on other machines
1000.times.
map { Whatever.new.apple }.
each_with_object(Hash.new(0)) { |val, memo| memo[val] += 1 }
# => {nil=>999, "whatever"=>1}
2000.times.
map { Whatever.new.apple }.
each_with_object(Hash.new(0)) { |val, memo| memo[val] += 1 }
# => {nil=>1998, "whatever"=>2}
(note: I cannot try with a higher number of iterations because the large amount of thread spawning causes my IRB to run out of resources)
This relates to what I've heard described as the "number one rule of async", namely that you can't get the return value of an asynchronous method from a synchronous one. The usual way to handle this is with a "callback", which Ruby can do in the form of yield / blocks.
I recommend looking for a tutorial about how to do asynchronous programming in Ruby.

why does Ruby Thread act up in this example - effectively missing files

I have a 50+ GB XML file which I initially tried to (man)handle with Nokogiri :)
Got killed: 9 - obviously :)
Now I'm into muddy Ruby threaded waters with this stab (at it):
#!/usr/bin/env ruby
def add_vehicle index, str
IO.write "ess_#{index}.xml", str
#file_name = "ess_#{index}.xml"
#fd = File.new file_name, "w"
#fd.write str
#fd.close
#puts file_name
end
begin
record = []
threads = []
counter = 1
file = File.new("../ess2.xml", "r")
while (line = file.gets)
case line
when /<ns:Statistik/
record = []
record << line
when /<\/ns:Statistik/
record << line
puts "file - %s" % counter
threads << Thread.new { add_vehicle counter, record.join }
counter += 1
else
record << line
end
end
file.close
threads.each { |thr| thr.join }
rescue => err
puts "Exception: #{err}"
err
end
Somehow this code 'skips' one or two files when writing the result files - hmmm!?
Okay, you have a problem because your file is huge, and you want to use multithreading.
Now have you. problemstwo
On a more serious note, I've had very good experience with this code.
It parsed 20GB xml files with almost no memory use.
Download the mentioned code, save it as xml_parser.rb and this script should work :
require_relative 'xml_parser.rb'
file = "../ess2.xml"
def add_vehicle index, str
filename = "ess_#{index}.xml"
File.open(filename,'w+'){|out| out.puts str}
puts format("%s has been written with %d lines", filename, str.each_line.count)
end
i=0
Xml::Parser.new(Nokogiri::XML::Reader(open(file))) do
for_element 'ns:Statistik' do
i+=1
add_vehicle(i,#node.outer_xml)
end
end
#=> ess_1.xml has been written with 102 lines
#=> ess_2.xml has been written with 102 lines
#=> ...
It will take time, but it should work without error and without using much memory.
By the way, here is the reason why your code missed some files :
threads = []
counter = 1
threads << Thread.new { puts counter }
counter += 1
threads.each { |thr| thr.join }
#=> 2
threads = []
counter = 1
threads << Thread.new { puts counter }
sleep(1)
counter += 1
threads.each { |thr| thr.join }
#=> 1
counter += 1 was faster than the add_vehicle call. So your add_vehicle was often called with the wrong counter. With many millions node, some might get 0 offset, some might get 1 offset. When 2 add_vehicle are called with the same id, they overwrite each other, and a file is missing.
You have the same problem with record, with lines getting written in the wrong file.
Perhabs you should try to synchronize counter += 1 with Mutex.
For example:
#lock = Mutex.new
#counter = 0
def add_vehicle str
#lock.synchronize do
#counter += 1
IO.write "ess_#{#counter}.xml", str
end
end
Mutex implements a simple semaphore that can be used to coordinate access to shared data from multiple concurrent threads.
Or you can go another way from the start and use Ox. It is way faster than Nokogiri, take a look on a comparison. For a huge files Ox::Sax

Ruby next multiple

Is there another way to write 'a'.next.next? I've looked all over and can't seem to find it.
I've tried multiplying the .next but I keep getting errors.
Well, this might not be a good idea in the case here, but if you're looking to chain a method n times in general, you can do something like this:
2.times.inject('a') { |s| s.next }
# => 'c'
20.times.inject('a') { |s| s.next }
# => 'u'
This starts with the value 'a', runs a block that calls next, then each successive result is fed back into the block.
For what it's worth, monkey-patching String can be fine for trivial scripts, but personally I'd try to look for other solutions first, like just adding a utility function to your class/module:
def repeat_next(str, n = 1)
n.times.inject(str) { |s| s.next }
end
A shortcut for your specific problem, (a.ord + 2).chr, potentially exists, although it's not the same thing.
You can just redefine String.next like this:
class String
alias_method :next1, :next
def next(n = 1)
str = self
for i in 1..n
str = str.next1
end
str
end
end
puts 'a'.next
puts 'a'.next(2)
puts 'a'.next(20)
If you're looking for a more succinct way of doing this, you could use: ('a'.ord + 2).chr. This will convert 'a' to a numerical representation (with the "ord" method), increment it by two, then converts it back to the character representation (with "chr").
You can monkey-patch the String class in ruby to add a method to do this for you:
class String
def get_nth_char(n)
current = self
while n > 0 do
current = current.next
n = n - 1
end
current
end
end
So you can do 'a'.get_nth_char(2) # => 'c'

Adding elements of different arrays together

I'm trying to use CSV to calculate the average of three numbers and output it to a separate file. Particularly, open one file, take the first value (name), and then calculate the average of the next three values. Do this multiple times for each person in the file.
Here is my Book1.csv
Tom,90,80,70
Adam,80,85,83
Mike,100,93,89
Dave,100,100,100
Rob,80,70,75
Nick,80,90,70
Justin,100,90,90
Jen,80,90,100
I'm trying to get it to output this:
Tom,80
Adam,83
Mike,94
Dave,100
Rob,75
Nick,80
Justin,93
Jen,90
I have each person in an array and I could get this to work with the basic "pseudo" code I have written, but it does not work.
Here is my code so far:
#!/usr/bin/ruby
require 'csv'
names=[]
grades1=[]
grades2=[]
grades3=[]
average=[]
i = 0
CSV.foreach('Book1.csv') do |students|
names << students.values_at(0)
grades1 << reader.values_at(1)
grades2 << reader.values_at(2)
grades3 << reader.values_at(3)
end
while i<10 do
average[i]= grades1[i] + grades2[i] + grades3[i]
i= i + 1
end
CSV.open('Book2.csv', 'w') do |writer|
rows.each { |record| writer << record }
end
The while loop part is the part that I am most concerned with. Any insight?
If you have an array of values that you want to sum, you can use:
sum = array.inject(:+)
If you change your data structure to:
grades = [ [], [], [] ]
...
grades[0] << reader.values_at(1)
Then you can do:
0.upto(9) do |i|
average[i] = (0..2).map{ |n| grades[n][i] }.inject(:+) / 3
end
There are a variety of ways to improve your data structures, the above being one of the least impactful to your code.
Any time you find yourself writing:
foo1 = ...
foo2 = ...
You should recognize it as code smell, and think of how you could organize your data in better collections.
Here's a rewrite of how I might do this. Notice that it works for any number of scores, not hardcoded to 3:
require 'csv'
averages = CSV.parse(DATA.read).map do |row|
name, *grades = *row
[ name, grades.map(&:to_i).inject(:+) / grades.length ]
end
puts averages.map(&:to_csv)
#=> Tom,80
#=> Adam,82
#=> Mike,94
#=> Dave,100
#=> Rob,75
#=> Nick,80
#=> Justin,93
#=> Jen,90
__END__
Tom,90,80,70
Adam,80,85,83
Mike,100,93,89
Dave,100,100,100
Rob,80,70,75
Nick,80,90,70
Justin,100,90,90
Jen,80,90,100

How do I limit the number of times a block is called?

In How do I limit the number of replacements when using gsub?, someone suggested the following way to do a limited number of substitutions:
str = 'aaaaaaaaaa'
count = 5
p str.gsub(/a/){if count.zero? then $& else count -= 1; 'x' end}
# => "xxxxxaaaaa"
It works, but the code mixes up how many times to substitute (5) with what the substitution should be ("x" if there should be a substitution, $& otherwise). Is it possible to seperate the two out?
(If it's too hard to seperate the two things out in this scenario, but it can be done in some other scenarios, post that as an answer)
How about just extracting the replacement as an argument and encapsulating the counter by having the block close over it inside a method?
str = "aaaaaaaaaaaaaaa"
def replacements(replacement, limit)
count = limit
lambda { |original| if count.zero? then original else count -= 1; replacement end }
end
p str.gsub(/a/, &replacements("x", 5))
You can make it even more general by using a block for the replacement:
def limit(n, &block)
count = n
lambda do |original|
if count.zero? then original else count -= 1; block.call(original) end
end
end
Now you can do stuff like
p str.gsub(/a/, &limit(5) { "x" })
p str.gsub(/a/, &limit(5, &:upcase))
gsub will call the block exactly as often as the regex matches the string. The only way to prevent that is to call break in the block, however that will also keep gsub from producing a meaningful return value.
So no, unless you call break in the block (which prevents any further code in the yielding method from running and thus prevents the method from returning anything), the number of times a method calls a block is solely determined by the method itself. So if you want gsub to yield only 5 times, the only way to do that is to pass in a regex which only matches the given strings five times.
Why are you using gsub()? By its design, gsub is designed to replace all occurrences of something, so, right off the bat you're fighting it.
Use sub instead:
str = 'aaaaaaaaaa'
count = 5
count.times { str.sub!(/a/, 'x') }
p str
# >> "xxxxxaaaaa"
str = 'mississippi'
2.times { str.sub!(/s/, '5') }
2.times { str.sub!(/s/, 'S') }
2.times { str.sub!(/i/, '1') }
p str
# >> "m1551SSippi"

Resources