I have this code:
require 'pp'
$pool = []
def work arg
$pool.push($$)
sleep (1 + rand(5))
$pool.delete($$)
exit
end
ary = []
100.times { |x| ary.push(x) }
while ary.any? do
while $pool.size < 10 && ary.any?
arg = ary.pop
Process.detach( fork { work(arg) } )
pp $pool.size
sleep 0.01
end
end
for me it's unexpected when I am filling $pool in work() pp $pool.size in second while loop always 0
where am I wrong?
When you create a new process with fork the new process gets a copy of the (empty) $pool. The $pool in the parent process which executes the loop is never populated with anything (the call to work function is executed in the child process), so it's size is always 0.
Related
I have a 50+ GB XML file which I initially tried to (man)handle with Nokogiri :)
Got killed: 9 - obviously :)
Now I'm into muddy Ruby threaded waters with this stab (at it):
#!/usr/bin/env ruby
def add_vehicle index, str
IO.write "ess_#{index}.xml", str
#file_name = "ess_#{index}.xml"
#fd = File.new file_name, "w"
#fd.write str
#fd.close
#puts file_name
end
begin
record = []
threads = []
counter = 1
file = File.new("../ess2.xml", "r")
while (line = file.gets)
case line
when /<ns:Statistik/
record = []
record << line
when /<\/ns:Statistik/
record << line
puts "file - %s" % counter
threads << Thread.new { add_vehicle counter, record.join }
counter += 1
else
record << line
end
end
file.close
threads.each { |thr| thr.join }
rescue => err
puts "Exception: #{err}"
err
end
Somehow this code 'skips' one or two files when writing the result files - hmmm!?
Okay, you have a problem because your file is huge, and you want to use multithreading.
Now have you. problemstwo
On a more serious note, I've had very good experience with this code.
It parsed 20GB xml files with almost no memory use.
Download the mentioned code, save it as xml_parser.rb and this script should work :
require_relative 'xml_parser.rb'
file = "../ess2.xml"
def add_vehicle index, str
filename = "ess_#{index}.xml"
File.open(filename,'w+'){|out| out.puts str}
puts format("%s has been written with %d lines", filename, str.each_line.count)
end
i=0
Xml::Parser.new(Nokogiri::XML::Reader(open(file))) do
for_element 'ns:Statistik' do
i+=1
add_vehicle(i,#node.outer_xml)
end
end
#=> ess_1.xml has been written with 102 lines
#=> ess_2.xml has been written with 102 lines
#=> ...
It will take time, but it should work without error and without using much memory.
By the way, here is the reason why your code missed some files :
threads = []
counter = 1
threads << Thread.new { puts counter }
counter += 1
threads.each { |thr| thr.join }
#=> 2
threads = []
counter = 1
threads << Thread.new { puts counter }
sleep(1)
counter += 1
threads.each { |thr| thr.join }
#=> 1
counter += 1 was faster than the add_vehicle call. So your add_vehicle was often called with the wrong counter. With many millions node, some might get 0 offset, some might get 1 offset. When 2 add_vehicle are called with the same id, they overwrite each other, and a file is missing.
You have the same problem with record, with lines getting written in the wrong file.
Perhabs you should try to synchronize counter += 1 with Mutex.
For example:
#lock = Mutex.new
#counter = 0
def add_vehicle str
#lock.synchronize do
#counter += 1
IO.write "ess_#{#counter}.xml", str
end
end
Mutex implements a simple semaphore that can be used to coordinate access to shared data from multiple concurrent threads.
Or you can go another way from the start and use Ox. It is way faster than Nokogiri, take a look on a comparison. For a huge files Ox::Sax
I have a Worker and Job example, where each Job has an expensive/slow perform method.
If I have 10 Jobs in my #job_table I'd like to work them off in batches of 5, each within their own process.
After the 5 processes (one batch) have exited I'm trying to remove those Jobs from the #job_table with delete_at.
I'm observing something unexpected in my implementation (see code below) though:
jobs:
[#<Job:0x007fd2230082a8 #id=0>,
#<Job:0x007fd223008280 #id=1>,
#<Job:0x007fd223008258 #id=2>,
#<Job:0x007fd223008208 #id=3>,
#<Job:0x007fd2230081e0 #id=4>,
#<Job:0x007fd2230081b8 #id=5>,
#<Job:0x007fd223008190 #id=6>,
#<Job:0x007fd223008168 #id=7>,
#<Job:0x007fd223008140 #id=8>,
#<Job:0x007fd223008118 #id=9>]
This is the #job_table before the first batch is run. I see that Jobs 0-4 have run and exited successfully (omitted output here).
So I'm calling remove_batch_1 and would expect jobs 0-4 to be removed from the #job_table, but this is what I'm observing instead:
jobs:
[#<Job:0x007fd223008280 #id=1>,
#<Job:0x007fd223008208 #id=3>,
#<Job:0x007fd2230081b8 #id=5>,
#<Job:0x007fd223008168 #id=7>,
#<Job:0x007fd223008118 #id=9>]
I've logged the i parameter in the method and it returns 0-4. But it looks like delete_at is removing other jobs (0,2,4,6,8).
I also wrote another method for removing a batch remove_batch_0 which uses slice! and behaves as expected.
BATCH_SIZE = 5 || ENV['BATCH_SIZE']
class Job
def initialize(id)
#id = id
end
def perform
puts "Job #{#id}> Start!"
sleep 1
puts "Job #{#id}> End!"
end
end
class Worker
def initialize
#job_table = []
fill_job_table
work_job_table
end
def fill_job_table
10.times do |i|
#job_table << Job.new(i)
end
end
def work_job_table
until #job_table.empty?
puts "jobs: "
pp #job_table
work_batch
Process.waitall
remove_batch_1
end
end
def work_batch
i = 0
while (i < #job_table.length && i < BATCH_SIZE)
fork { #job_table[i].perform }
i += 1
end
end
def remove_batch_1
i = 0
while (i < #job_table.length && i < BATCH_SIZE)
#job_table.delete_at(i)
i += 1
end
end
def remove_batch_0
#job_table.slice!(0..BATCH_SIZE-1)
end
end
Worker.new
You use delete_at in a while loop. Let's see what happens:
Image you have an array [0,1,2,3,4,5] and you call:
(1..3).each { |i| array.deleted_at(i) }
In the first iteration you will delete the first element from the array, the array will look like this after this step: [1,2,3,4,5] In the next iteration you will delete the second element, what leads to [1,3,4,5]. Then you delete the third: [1,3,5]
You might want to use Array#shift instead:
def remove_batch_1
#job_table.shift(BATCH_SIZE)
end
I have the following code to block until all threads have finished (Gist):
ThreadsWait.all_waits(*threads)
What's the simplest way to set a timeout here, ie kill any threads if they are still running after e.g. 3 seconds?
Thread#join accepts an argument after which it will time out. Try this, for example:
5.times.map do |i|
Thread.new do
1_000_000_000.times { |i| i } # takes more than a second
puts "Finished" # will never print
end
end.each { |t| t.join(1) } # times out after a second
p 'stuff I want to execute after finishing the threads' # will print
If you have some things you want to execute before joining, you can do:
5.times.map do |i|
Thread.new do
1_000_000_000.times { |i| i } # takes more than a second
puts "Finished" # will never print
end
end.each do |thread|
puts 'Stuff I want to do before join' # Will print, multiple times
thread.join(1)
end
This question comes from this code snippet:
lambda do
$SAFE = 2
puts $SAFE
end .call
puts $SAFE
The result is:
2
0
$SAFE is a global variable, so I can't understand this. I explored it for a while, and then found $SAFE is a thread-local variable, not a real global.
OK, I can understand this:
k = Thread.new do
$SAFE = 2
puts $SAFE
end
k.run
1000000.times {}
puts $SAFE
But wait, the block will open another thread to run it?
No, blocks (procs, lambdas) do not run in their own threads. The issue here is that Ruby saves and restores the $SAFE level around each and every method (and proc) call. If you try this with another variable, like $FOO, you get the expected results:
> x = ->{ $FOO = 1; puts $FOO }.call; puts $FOO
1
1
You can see where this is implemented in rb_method_call in proc.c:
const int safe_level_to_run = 4 /*SAFE_LEVEL_MAX*/;
safe = rb_safe_level();
if (rb_safe_level() < safe_level_to_run) {
rb_set_safe_level_force(safe_level_to_run);
}
// ...
// Invoke the block
// ...
if (safe >= 0)
rb_set_safe_level_force(safe);
The safe level is saved, and if it's less than 4, it's set to 4. The block is then called, and if the safe level before modification was >= 0, it's restored to what it was before. You can see this in action with something like the following:
> puts $SAFE; ->{ puts $SAFE; $SAFE = 1; puts $SAFE }.call; puts $SAFE
0
0
1
0
$SAFE is 0 heading into the block, and the block is executed, and then it's restored to 0 as the block exits, despite being modified to be 1 inside the block.
I have a large file (hundreds of megs) that consists of filenames, one per line.
I need to loop through the list of filenames, and fork off a process for each filename. I want a maximum of 8 forked processes at a time and I don't want to read the whole filename list into RAM at once.
I'm not even sure where to begin, can anyone help me out?
File.foreach("large_file").each_slice(8) do |eight_lines|
# eight_lines is an array containing 8 lines.
# at this point you can iterate over these filenames
# and spawn off your processes/threads
end
It sounds like the Process module will be useful for this task. Here's something I quickly threw together as a starting point:
include Process
i = 0
for line in open('files.txt') do
i += 1
fork { `sleep #{rand} && echo "#{i} - #{line.chomp}" >> numbers.txt` }
if i >= 8
wait # join any single child process
i -= 1
end
end
waitall # join all remaining child processes
Output:
hello
goodbye
test1
test2
a
b
c
d
e
f
g
$ ruby b.rb
$ cat numbers.txt
1 - hello
3 -
2 - goodbye
5 - test2
6 - a
4 - test1
7 - b
8 - c
8 - d
8 - e
8 - f
8 - g
The way this works is that:
for line in open(XXX) will lazily iterate over the lines of the file you specify.
fork will spawn a child process executing the given block, and in this case, we use backticks to indicate something to be executed by the shell. Note that rand returns a value 0-1 here so we are sleeping less than a second, and I call line.chomp to remove the trailing newline that we get from line.
If we've accumulated 8 or more processes, call wait to stop everything until one of them returns.
Finally, outside the loop, call waitall to join all remaining processes before exiting the script.
Here's Mark's solution wrapped up as a ProcessPool class, might be helpful to have it around (and please correct me if I made some mistake):
class ProcessPool
def initialize pool_size
#pool_size = pool_size
#free_slots = #pool_size
end
def fork &p
if #free_slots == 0
Process.wait
#free_slots += 1
end
#free_slots -= 1
puts "Free slots: #{#free_slots}"
Process.fork &p
end
def waitall
Process.waitall
end
end
pool = ProcessPool.new 8
for line in open('files.txt') do
pool.fork { Kernel.sleep rand(10); puts line.chomp }
end
pool.waitall
puts 'finished'
The standard library documentation for Queue has
require 'thread'
queue = Queue.new
producer = Thread.new do
5.times do |i|
sleep rand(i) # simulate expense
queue << i
puts "#{i} produced"
end
end
consumer = Thread.new do
5.times do |i|
value = queue.pop
sleep rand(i/2) # simulate expense
puts "consumed #{value}"
end
end
consumer.join
I do find it a little verbose though.
Wikipedia describes this as a thread pool pattern
arr = IO.readlines("filename")