Reading a file N lines at a time in ruby - ruby

I have a large file (hundreds of megs) that consists of filenames, one per line.
I need to loop through the list of filenames, and fork off a process for each filename. I want a maximum of 8 forked processes at a time and I don't want to read the whole filename list into RAM at once.
I'm not even sure where to begin, can anyone help me out?

File.foreach("large_file").each_slice(8) do |eight_lines|
# eight_lines is an array containing 8 lines.
# at this point you can iterate over these filenames
# and spawn off your processes/threads
end

It sounds like the Process module will be useful for this task. Here's something I quickly threw together as a starting point:
include Process
i = 0
for line in open('files.txt') do
i += 1
fork { `sleep #{rand} && echo "#{i} - #{line.chomp}" >> numbers.txt` }
if i >= 8
wait # join any single child process
i -= 1
end
end
waitall # join all remaining child processes
Output:
hello
goodbye
test1
test2
a
b
c
d
e
f
g
$ ruby b.rb
$ cat numbers.txt
1 - hello
3 -
2 - goodbye
5 - test2
6 - a
4 - test1
7 - b
8 - c
8 - d
8 - e
8 - f
8 - g
The way this works is that:
for line in open(XXX) will lazily iterate over the lines of the file you specify.
fork will spawn a child process executing the given block, and in this case, we use backticks to indicate something to be executed by the shell. Note that rand returns a value 0-1 here so we are sleeping less than a second, and I call line.chomp to remove the trailing newline that we get from line.
If we've accumulated 8 or more processes, call wait to stop everything until one of them returns.
Finally, outside the loop, call waitall to join all remaining processes before exiting the script.

Here's Mark's solution wrapped up as a ProcessPool class, might be helpful to have it around (and please correct me if I made some mistake):
class ProcessPool
def initialize pool_size
#pool_size = pool_size
#free_slots = #pool_size
end
def fork &p
if #free_slots == 0
Process.wait
#free_slots += 1
end
#free_slots -= 1
puts "Free slots: #{#free_slots}"
Process.fork &p
end
def waitall
Process.waitall
end
end
pool = ProcessPool.new 8
for line in open('files.txt') do
pool.fork { Kernel.sleep rand(10); puts line.chomp }
end
pool.waitall
puts 'finished'

The standard library documentation for Queue has
require 'thread'
queue = Queue.new
producer = Thread.new do
5.times do |i|
sleep rand(i) # simulate expense
queue << i
puts "#{i} produced"
end
end
consumer = Thread.new do
5.times do |i|
value = queue.pop
sleep rand(i/2) # simulate expense
puts "consumed #{value}"
end
end
consumer.join
I do find it a little verbose though.
Wikipedia describes this as a thread pool pattern

arr = IO.readlines("filename")

Related

why does Ruby Thread act up in this example - effectively missing files

I have a 50+ GB XML file which I initially tried to (man)handle with Nokogiri :)
Got killed: 9 - obviously :)
Now I'm into muddy Ruby threaded waters with this stab (at it):
#!/usr/bin/env ruby
def add_vehicle index, str
IO.write "ess_#{index}.xml", str
#file_name = "ess_#{index}.xml"
#fd = File.new file_name, "w"
#fd.write str
#fd.close
#puts file_name
end
begin
record = []
threads = []
counter = 1
file = File.new("../ess2.xml", "r")
while (line = file.gets)
case line
when /<ns:Statistik/
record = []
record << line
when /<\/ns:Statistik/
record << line
puts "file - %s" % counter
threads << Thread.new { add_vehicle counter, record.join }
counter += 1
else
record << line
end
end
file.close
threads.each { |thr| thr.join }
rescue => err
puts "Exception: #{err}"
err
end
Somehow this code 'skips' one or two files when writing the result files - hmmm!?
Okay, you have a problem because your file is huge, and you want to use multithreading.
Now have you. problemstwo
On a more serious note, I've had very good experience with this code.
It parsed 20GB xml files with almost no memory use.
Download the mentioned code, save it as xml_parser.rb and this script should work :
require_relative 'xml_parser.rb'
file = "../ess2.xml"
def add_vehicle index, str
filename = "ess_#{index}.xml"
File.open(filename,'w+'){|out| out.puts str}
puts format("%s has been written with %d lines", filename, str.each_line.count)
end
i=0
Xml::Parser.new(Nokogiri::XML::Reader(open(file))) do
for_element 'ns:Statistik' do
i+=1
add_vehicle(i,#node.outer_xml)
end
end
#=> ess_1.xml has been written with 102 lines
#=> ess_2.xml has been written with 102 lines
#=> ...
It will take time, but it should work without error and without using much memory.
By the way, here is the reason why your code missed some files :
threads = []
counter = 1
threads << Thread.new { puts counter }
counter += 1
threads.each { |thr| thr.join }
#=> 2
threads = []
counter = 1
threads << Thread.new { puts counter }
sleep(1)
counter += 1
threads.each { |thr| thr.join }
#=> 1
counter += 1 was faster than the add_vehicle call. So your add_vehicle was often called with the wrong counter. With many millions node, some might get 0 offset, some might get 1 offset. When 2 add_vehicle are called with the same id, they overwrite each other, and a file is missing.
You have the same problem with record, with lines getting written in the wrong file.
Perhabs you should try to synchronize counter += 1 with Mutex.
For example:
#lock = Mutex.new
#counter = 0
def add_vehicle str
#lock.synchronize do
#counter += 1
IO.write "ess_#{#counter}.xml", str
end
end
Mutex implements a simple semaphore that can be used to coordinate access to shared data from multiple concurrent threads.
Or you can go another way from the start and use Ox. It is way faster than Nokogiri, take a look on a comparison. For a huge files Ox::Sax

Is it possible to call multiple methods or objects at once?

Say I have a method that includes a counter that outputs it's count to the screen on every tick.
Elsewhere in the program, a new version of this method is called, so they both/all run at once, have different counters, and update together with the tick. Is it possible to do this with Ruby? Normally creating another instance of an object is what I would do, I am still new to Ruby though and getting the hang of it.
I will edit with sample code of what I am trying to achieve later. I'm currently on a mobile without access to a computer.
Here I'm creating two instances of a Counter, both counters are initially set to 0. Then I launch them 3 seconds apart - each in its own thread. They start to print out numbers.
class Counter
def initialize
#counter = 0 # initial counter to 0
end
def run
loop do
# wait one second, print the counter and increase it
sleep 1
puts #counter
#counter += 1
end
end
end
threads = []
2.times do
# put each counter in a separate thread
threads << Thread.new do
counter = Counter.new
counter.run
end
sleep 3 # make a pause between launching counters
end
threads.each(&:join)
Output I get:
0 # first
1 # first
2 # first
0 # second
3 # first
1 # second
4 # first
2 # second
5 # first
The only trick here is to use Thread class, otherwise second counter will never start to work since the first counter will block the whole process.
You could use a queue and an external loop, something like:
class Counter
def initialize(start)
#count = start
end
def tick
#count += 1
puts #count
end
end
queue = []
queue << Counter.new(0)
queue << Counter.new(100)
5.times do |i|
puts "--- tick #{i} ---"
queue.each(&:tick)
sleep 1
end
Output:
--- tick 0 ---
1
101
--- tick 1 ---
2
102
--- tick 2 ---
3
103
--- tick 3 ---
4
104
--- tick 4 ---
5
105
Within the 5.times loop, tick is sent to each item in the queue. Note that the methods are called in the order the counters were added to the queue, i.e. they are not called simultaneously.
For your purpose you could use either Event loop, or Processes, or Threads. Because in common case Ruby will be blocked while method is executing (till it will return control with return).
class ThreadCounter
def run
#thread ||= Thread.new do
i = 0
while !#stop do
puts i+=1
sleep(1)
end
#stop = nil
end
end
def stop
#stop = true
#thread && #thread.join
end
end
counter1 = ThreadCounter.new
counter2 = ThreadCounter.new
counter1.run
counter2.run
# wait some time
counter1.stop
counter2.stop

Ruby Variable Reference Issue

I am not fluent in ruby and am having trouble with the following code example. I want to pass the array index to the thread function. When I run this code, all threads print "4". They should instead print "0 1 2 3 4" (in any order).
It seems that the num variable is being shared between all iterations of the loop and passes a reference to the "test" function. The loop finishes before the threads start and num is left equal to 4.
What is going on and how do I get the correct behavior?
NUM_THREADS = 5
def test(num)
puts num.to_s()
end
threads = Array.new(NUM_THREADS)
for i in 0..(NUM_THREADS - 1)
num = i
threads[i] = Thread.new{test(num)}
end
for i in 0..(NUM_THREADS - 1)
threads[i].join
end
Your script does what I would expect in Unix but not in Windows, most likely because the thread instantiation is competing with the for loop for using the num value. I think the reason is that the for loop does not create a closure, so after finishing that loop num is equal to 4:
for i in 0..4
end
puts i
# => 4
To fix it (and write more idiomatic Ruby), you could write something like this:
NUM_THREADS = 5
def test(num)
puts num # to_s is unnecessary
end
# Create an array for each thread that runs test on each index
threads = NUM_THREADS.times.map { |i| Thread.new { test i } }
# Call the join method on each thread
threads.each(&:join)
where i would be local to the map block.
"What is going on?" => The scope of num is the main environment, so it is shared by all threads (The only thing surrounding it is the for keyword, which does not create a scope). The execution of puts in all threads was later than the for loop on i incrementing it to 4. A variable passed to a thread as an argument (such as num below) becomes a block argument, and will not be shared outside of the thread.
NUM_THREADS = 5
threads = Array.new(NUM_THREADS){|i| Thread.new(i){|num| puts num}}.each(&:join)

Multithreading calculations in ruby

I want to create a script to calculate numbers in multiple threads. Each thread will calculate the powers of 2 but the first thread must start calculating from 2, the second from 4, and the third from 8, printing some text in-between.
Example:
Im a thread and these are my results
2
4
8
Im a thread and these are my results
4
8
16
Im a thread and these are my results
8
16
32
My fail code:
def loopa(s)
3.times do
puts s
s=s**2
end
end
threads=[]
num=2
until num == 8 do
threads << Thread.new{ loopa(num) }
num=num**2
end
threads.each { |x| puts "Im a thread and these are my results" ; x.join }
My fail results:
Im a thread and these are my results
8
64
4096
8
64
4096
8
64
4096
Im a thread and these are my results
Im a thread and these are my results
I suggest you read the "Threads and Processes" chapter Pragmatic Programmer's ruby book. Here's an old version online. The section called "Creating Ruby Threads" is especially relevant to your question.
To fix the problem, you need to change your Thread.new line to this:
threads << Thread.new(num){|n| loopa(n) }
Your version doesn't work because num is shared between threads, and may be changed by another thread. By passing the variable via a block, the block variable is no longer shared.
More Info
Also, there's an error in your math.
Output values will be:
Thread 1: 2 4 16
Thread 2: 4 16 256
Thread 3: 6 36 1296
"8" is never reached because the until condition quits as soon as it sees "8".
If you want clearer output, use this as the body of loopa:
3.times do
print "#{Thread.current}: #{s}\n"
s=s**2
end
This lets you distinguish the 3 threads. Note that it's better to use a print command with a newline-terminated string versus using puts without a newline, because the latter prints the newline as a separate instruction, which may be interrupted by another thread.
It's normal. Read what you write. Firstly you run 3 threads that are async so output will be in various of combinations of threads output. Then you write 'Im a thread and these are my results' and join each thread. Also remember that Ruby has only references. So if you pass num to thread and then change it it will change in all threads. To avoid it write:
threads = (1..3).map do |i|
puts "I'm starting thread no #{i}"
Thread.new { loopa(2**i) }
end
I feel the need to post a mathematically correct version:
def loopa(s)
3.times do
print "#{Thread.current}: #{s}\n"
s *= 2
end
end
threads=[]
num=2
while num <= 8 do
threads << Thread.new(num){|n| loopa(n) }
num *= 2
end
threads.each { |x| print "Im a thread and these are my results\n" ; x.join }
Bonus 1: threadless solution (naive)
power = 1
workers = 3
iterations = 3
(power ... power + workers).each do |pow|
worker_pow = 2 ** pow
puts "I'm a worker and these are my results"
iterations.times do |inum|
puts worker_pow
worker_pow *= 2
end
end
Bonus 2: threadless solution (cached)
power = 1
workers = 3
iterations = 3
cache_size = workers + iterations - 1
# generate all the values upfront
cache = []
(power ... power+cache_size).each do |i|
cache << 2**i
end
workers.times do |wnum|
puts "I'm a worker and these are my results"
# use a sliding-window to grab the part of the cache we want
puts cache[wnum,3]
end

Threads in Ruby

Why does this code work (I see the output 1 2 3):
for i in 1..3
Thread.new{
puts i
}
end
However, the following code does not produce the same output (I do not see the output 1 2 3)?
for i in 1..3
Thread.new{
sleep(5)
puts i
}
end
When you hit the end of the script, Ruby exits. If you add sleep 10 after the final loop, you can see the output show up. (Albeit, as 3 each time, because the binding to i reflects the value at the end of processing, and the sleep causes a thread switch back to the loop.)
You might want something like:
threads = []
for i in 1..3
threads << Thread.new {
sleep 5
puts i
}
end
threads.map {|t| t.join }
That will wait for all the threads to terminate before exiting.

Resources