Why is delete_at not removing elements at supplied index? - ruby

I have a Worker and Job example, where each Job has an expensive/slow perform method.
If I have 10 Jobs in my #job_table I'd like to work them off in batches of 5, each within their own process.
After the 5 processes (one batch) have exited I'm trying to remove those Jobs from the #job_table with delete_at.
I'm observing something unexpected in my implementation (see code below) though:
jobs:
[#<Job:0x007fd2230082a8 #id=0>,
#<Job:0x007fd223008280 #id=1>,
#<Job:0x007fd223008258 #id=2>,
#<Job:0x007fd223008208 #id=3>,
#<Job:0x007fd2230081e0 #id=4>,
#<Job:0x007fd2230081b8 #id=5>,
#<Job:0x007fd223008190 #id=6>,
#<Job:0x007fd223008168 #id=7>,
#<Job:0x007fd223008140 #id=8>,
#<Job:0x007fd223008118 #id=9>]
This is the #job_table before the first batch is run. I see that Jobs 0-4 have run and exited successfully (omitted output here).
So I'm calling remove_batch_1 and would expect jobs 0-4 to be removed from the #job_table, but this is what I'm observing instead:
jobs:
[#<Job:0x007fd223008280 #id=1>,
#<Job:0x007fd223008208 #id=3>,
#<Job:0x007fd2230081b8 #id=5>,
#<Job:0x007fd223008168 #id=7>,
#<Job:0x007fd223008118 #id=9>]
I've logged the i parameter in the method and it returns 0-4. But it looks like delete_at is removing other jobs (0,2,4,6,8).
I also wrote another method for removing a batch remove_batch_0 which uses slice! and behaves as expected.
BATCH_SIZE = 5 || ENV['BATCH_SIZE']
class Job
def initialize(id)
#id = id
end
def perform
puts "Job #{#id}> Start!"
sleep 1
puts "Job #{#id}> End!"
end
end
class Worker
def initialize
#job_table = []
fill_job_table
work_job_table
end
def fill_job_table
10.times do |i|
#job_table << Job.new(i)
end
end
def work_job_table
until #job_table.empty?
puts "jobs: "
pp #job_table
work_batch
Process.waitall
remove_batch_1
end
end
def work_batch
i = 0
while (i < #job_table.length && i < BATCH_SIZE)
fork { #job_table[i].perform }
i += 1
end
end
def remove_batch_1
i = 0
while (i < #job_table.length && i < BATCH_SIZE)
#job_table.delete_at(i)
i += 1
end
end
def remove_batch_0
#job_table.slice!(0..BATCH_SIZE-1)
end
end
Worker.new

You use delete_at in a while loop. Let's see what happens:
Image you have an array [0,1,2,3,4,5] and you call:
(1..3).each { |i| array.deleted_at(i) }
In the first iteration you will delete the first element from the array, the array will look like this after this step: [1,2,3,4,5] In the next iteration you will delete the second element, what leads to [1,3,4,5]. Then you delete the third: [1,3,5]
You might want to use Array#shift instead:
def remove_batch_1
#job_table.shift(BATCH_SIZE)
end

Related

why does Ruby Thread act up in this example - effectively missing files

I have a 50+ GB XML file which I initially tried to (man)handle with Nokogiri :)
Got killed: 9 - obviously :)
Now I'm into muddy Ruby threaded waters with this stab (at it):
#!/usr/bin/env ruby
def add_vehicle index, str
IO.write "ess_#{index}.xml", str
#file_name = "ess_#{index}.xml"
#fd = File.new file_name, "w"
#fd.write str
#fd.close
#puts file_name
end
begin
record = []
threads = []
counter = 1
file = File.new("../ess2.xml", "r")
while (line = file.gets)
case line
when /<ns:Statistik/
record = []
record << line
when /<\/ns:Statistik/
record << line
puts "file - %s" % counter
threads << Thread.new { add_vehicle counter, record.join }
counter += 1
else
record << line
end
end
file.close
threads.each { |thr| thr.join }
rescue => err
puts "Exception: #{err}"
err
end
Somehow this code 'skips' one or two files when writing the result files - hmmm!?
Okay, you have a problem because your file is huge, and you want to use multithreading.
Now have you. problemstwo
On a more serious note, I've had very good experience with this code.
It parsed 20GB xml files with almost no memory use.
Download the mentioned code, save it as xml_parser.rb and this script should work :
require_relative 'xml_parser.rb'
file = "../ess2.xml"
def add_vehicle index, str
filename = "ess_#{index}.xml"
File.open(filename,'w+'){|out| out.puts str}
puts format("%s has been written with %d lines", filename, str.each_line.count)
end
i=0
Xml::Parser.new(Nokogiri::XML::Reader(open(file))) do
for_element 'ns:Statistik' do
i+=1
add_vehicle(i,#node.outer_xml)
end
end
#=> ess_1.xml has been written with 102 lines
#=> ess_2.xml has been written with 102 lines
#=> ...
It will take time, but it should work without error and without using much memory.
By the way, here is the reason why your code missed some files :
threads = []
counter = 1
threads << Thread.new { puts counter }
counter += 1
threads.each { |thr| thr.join }
#=> 2
threads = []
counter = 1
threads << Thread.new { puts counter }
sleep(1)
counter += 1
threads.each { |thr| thr.join }
#=> 1
counter += 1 was faster than the add_vehicle call. So your add_vehicle was often called with the wrong counter. With many millions node, some might get 0 offset, some might get 1 offset. When 2 add_vehicle are called with the same id, they overwrite each other, and a file is missing.
You have the same problem with record, with lines getting written in the wrong file.
Perhabs you should try to synchronize counter += 1 with Mutex.
For example:
#lock = Mutex.new
#counter = 0
def add_vehicle str
#lock.synchronize do
#counter += 1
IO.write "ess_#{#counter}.xml", str
end
end
Mutex implements a simple semaphore that can be used to coordinate access to shared data from multiple concurrent threads.
Or you can go another way from the start and use Ox. It is way faster than Nokogiri, take a look on a comparison. For a huge files Ox::Sax

Is it possible to call multiple methods or objects at once?

Say I have a method that includes a counter that outputs it's count to the screen on every tick.
Elsewhere in the program, a new version of this method is called, so they both/all run at once, have different counters, and update together with the tick. Is it possible to do this with Ruby? Normally creating another instance of an object is what I would do, I am still new to Ruby though and getting the hang of it.
I will edit with sample code of what I am trying to achieve later. I'm currently on a mobile without access to a computer.
Here I'm creating two instances of a Counter, both counters are initially set to 0. Then I launch them 3 seconds apart - each in its own thread. They start to print out numbers.
class Counter
def initialize
#counter = 0 # initial counter to 0
end
def run
loop do
# wait one second, print the counter and increase it
sleep 1
puts #counter
#counter += 1
end
end
end
threads = []
2.times do
# put each counter in a separate thread
threads << Thread.new do
counter = Counter.new
counter.run
end
sleep 3 # make a pause between launching counters
end
threads.each(&:join)
Output I get:
0 # first
1 # first
2 # first
0 # second
3 # first
1 # second
4 # first
2 # second
5 # first
The only trick here is to use Thread class, otherwise second counter will never start to work since the first counter will block the whole process.
You could use a queue and an external loop, something like:
class Counter
def initialize(start)
#count = start
end
def tick
#count += 1
puts #count
end
end
queue = []
queue << Counter.new(0)
queue << Counter.new(100)
5.times do |i|
puts "--- tick #{i} ---"
queue.each(&:tick)
sleep 1
end
Output:
--- tick 0 ---
1
101
--- tick 1 ---
2
102
--- tick 2 ---
3
103
--- tick 3 ---
4
104
--- tick 4 ---
5
105
Within the 5.times loop, tick is sent to each item in the queue. Note that the methods are called in the order the counters were added to the queue, i.e. they are not called simultaneously.
For your purpose you could use either Event loop, or Processes, or Threads. Because in common case Ruby will be blocked while method is executing (till it will return control with return).
class ThreadCounter
def run
#thread ||= Thread.new do
i = 0
while !#stop do
puts i+=1
sleep(1)
end
#stop = nil
end
end
def stop
#stop = true
#thread && #thread.join
end
end
counter1 = ThreadCounter.new
counter2 = ThreadCounter.new
counter1.run
counter2.run
# wait some time
counter1.stop
counter2.stop

Ruby Put Periodic Progress Messages While Mapping

I am mapping an array of items, but the collection can be quite large. I would like to put a message to console every so often, to give an indication of progress. Is there a way to do that during the mapping process?
This is my map statement:
famgui = family_items.map{|i|i.getGuid}
I have a def that I use for giving an update when I am doing a for each or while loop.
This is the def:
def doneloop(saymyname, i)
if (i%25000 == 0 )
puts "#{i} #{saymyname}"
end
end
I normally put x = 0 before I start the loop, then x +=1 once I am in the loop and then at the end of my loop, I put saymyname = "specific type items gathered at #{Time.now}"
Then I put doneloop(saymyname, x)
I am not sure how to do that when I am mapping, as there is no loop to construct this around. Does anyone have a method to give updates when using map?
Thanks!
You can map with index:
famgui = family_items.with_index.map {|item, index| item.getGuid; doneloop('sth', index)}
Only the last expression is returned from a map, so you can do something like:
famgui = family_items.with_index.map do |i, idx|
if idx % 100 == 0
puts # extra linefeed
# report every 100th round
puts "items left: #{family_items_size - idx}"
STDOUT.flush
end
current_item += 1
print "."
STDOUT.flush
i.getGuid
end
This will print "." for each item and a status report after every 100 items.
If you want, you can use each_with and populate the array yourself like:
famgui = []
family_items.each_with_index do |i, idx|
famgui << i.getGuid
puts "just did: #{idx} of #{family_items.size}"
end

unexpected ruby global array variable behaviour

I have this code:
require 'pp'
$pool = []
def work arg
$pool.push($$)
sleep (1 + rand(5))
$pool.delete($$)
exit
end
ary = []
100.times { |x| ary.push(x) }
while ary.any? do
while $pool.size < 10 && ary.any?
arg = ary.pop
Process.detach( fork { work(arg) } )
pp $pool.size
sleep 0.01
end
end
for me it's unexpected when I am filling $pool in work() pp $pool.size in second while loop always 0
where am I wrong?
When you create a new process with fork the new process gets a copy of the (empty) $pool. The $pool in the parent process which executes the loop is never populated with anything (the call to work function is executed in the child process), so it's size is always 0.

Reading a file N lines at a time in ruby

I have a large file (hundreds of megs) that consists of filenames, one per line.
I need to loop through the list of filenames, and fork off a process for each filename. I want a maximum of 8 forked processes at a time and I don't want to read the whole filename list into RAM at once.
I'm not even sure where to begin, can anyone help me out?
File.foreach("large_file").each_slice(8) do |eight_lines|
# eight_lines is an array containing 8 lines.
# at this point you can iterate over these filenames
# and spawn off your processes/threads
end
It sounds like the Process module will be useful for this task. Here's something I quickly threw together as a starting point:
include Process
i = 0
for line in open('files.txt') do
i += 1
fork { `sleep #{rand} && echo "#{i} - #{line.chomp}" >> numbers.txt` }
if i >= 8
wait # join any single child process
i -= 1
end
end
waitall # join all remaining child processes
Output:
hello
goodbye
test1
test2
a
b
c
d
e
f
g
$ ruby b.rb
$ cat numbers.txt
1 - hello
3 -
2 - goodbye
5 - test2
6 - a
4 - test1
7 - b
8 - c
8 - d
8 - e
8 - f
8 - g
The way this works is that:
for line in open(XXX) will lazily iterate over the lines of the file you specify.
fork will spawn a child process executing the given block, and in this case, we use backticks to indicate something to be executed by the shell. Note that rand returns a value 0-1 here so we are sleeping less than a second, and I call line.chomp to remove the trailing newline that we get from line.
If we've accumulated 8 or more processes, call wait to stop everything until one of them returns.
Finally, outside the loop, call waitall to join all remaining processes before exiting the script.
Here's Mark's solution wrapped up as a ProcessPool class, might be helpful to have it around (and please correct me if I made some mistake):
class ProcessPool
def initialize pool_size
#pool_size = pool_size
#free_slots = #pool_size
end
def fork &p
if #free_slots == 0
Process.wait
#free_slots += 1
end
#free_slots -= 1
puts "Free slots: #{#free_slots}"
Process.fork &p
end
def waitall
Process.waitall
end
end
pool = ProcessPool.new 8
for line in open('files.txt') do
pool.fork { Kernel.sleep rand(10); puts line.chomp }
end
pool.waitall
puts 'finished'
The standard library documentation for Queue has
require 'thread'
queue = Queue.new
producer = Thread.new do
5.times do |i|
sleep rand(i) # simulate expense
queue << i
puts "#{i} produced"
end
end
consumer = Thread.new do
5.times do |i|
value = queue.pop
sleep rand(i/2) # simulate expense
puts "consumed #{value}"
end
end
consumer.join
I do find it a little verbose though.
Wikipedia describes this as a thread pool pattern
arr = IO.readlines("filename")

Resources