Scaling a ruby script by launching multiple processes instead of using threads

Scaling a ruby script by launching multiple processes instead of using threads - ruby

I want to increase the throughput of a script which does net I/O (a scraper). Instead of making it multithreaded in ruby (I use the default 1.9.1 interpreter), I want to launch multiple processes. So, is there a system for doing this to where I can track when one finishes to re-launch it again so that I have X number running at any time. ALso some will run with different command args. I was thinking of writing a bash script but it sounds like a potentially bad idea if there already exists a method for doing something like this on linux.

I would recommend not forking but instead that you use EventMachine (and the excellent em-http-request if you're doing HTTP). Managing multiple processes can be a bit of a handful, even more so than handling multiple threads, but going down the evented path is, in comparison, much simpler. Since you want to do mostly network IO, which consist mostly of waiting, I think that an evented approach would scale as well, or better than forking or threading. And most importantly: it will require much less code, and it will be more readable.
Even if you decide on running separate processes for each task, EventMachine can help you write the code that manages the subprocesses using, for example, EventMachine.popen.
And finally, if you want to do it without EventMachine, read the docs for IO.popen, Open3.popen and Open4.popen. All do more or less the same thing but give you access to the stdin, stdout, stderr (Open3, Open4), and pid (Open4) of the subprocess.

You can try fork http://ruby-doc.org/core/classes/Process.html#M003148
You can get the PID in return and see if this process run again or not.
If you want manage IO concurrency. I suggest you to use EventMachine.

You can either
implement (or find an equivalent gem) a ThreadPool (ProcessPool, in your case), or
prepare an array of all, let's say 1000 tasks to be processed, split it into, say 10 chunks of 100 tasks (10 being the number of parallel processes you want to launch), and launch 10 processes, of which each process right away receives 100 tasks to process. That way you don't need to launch 1000 processes and control that not more than 10 of them work at the same time.

Related

Reading file in parallel from multiple processes

I'm running multiple processes in parallel and each of these processes read the same file in parallel. It looks like some of the processes see a corrupted version of the file if I increase the number of processes to > 15 or so. What is the recommended way of handling such a scenario?
More details:
The file being read in parallel is actually a perl script. The multiple jobs are python processes, and each of them launch this perl script independently with different input parameters. When the number of jobs is increased, some of these jobs give errors that the perl script has invalid syntax (which is not true). Hence, I suspect that some of these jobs read in corrupted versions of the perl script.
I'm running all of this on a 32core machine.

If any process is also writing to the file, then you need to enforce some synchronization, for example with a global named mutex.
If there is no asynchronous writing going on, I would not expect to see corruption during the reads. Are you opening the files with "r" access? If you're still encountering troubles, it might be worth experimenting with reducing read buffer size. Or call out to a native win32 API for the file access.
Good luck!

How to detect that foreground process is waiting for input in UNIX?

I have to create a script (ksh or perl) that starts certain number of parallel jobs (another scripts), each of them runs as a foreground process in a separate session. Plus I start monitoring job that has to determine if any of those scripts is expecting input from operator, and switch to the corresponding session if necessary.
My problem is that I have not found a good way to determine that process is expecting input. For the background process it's pretty easy: process state is "stopped" and this can be easily checked with 'ps' command. In case of foreground process this does not work.
So far I tried to attach to the process with dbx or truss to see if it's hanging on 'read', but this approach seems too heavyweight.
Could you suggest some better solution? Perl, shell, C, Java, etc. … is ok as long as it’s standard and does not require extra 3rd party or OS-specific stuff to install.
Thank you.

What you're asking isn't possible, at least not reliably. The process may be using select or other polling method rather than blocking on a read call. You can't know whether it's waiting for operator input or busy doing other stuff, and in general it could be both (doing stuff in the background while being responsive to operator input).
The normal way for a program to signal that it's waiting for operator input is to print a prompt. Thus you should consider a session to be active if it's displayed a prompt since the last time you fed it input.
If your programs don't behave this way, you'll need to find some other program-specific way to know that these processes are waiting for input.

Go program launching several processes

I'm playing with Go to understand its features and syntax. I've done a simple producer-consumer program with concurrent go functions, and a priority buffer in the middle. A single producer produces tasks with a certain priority and send them to a buffer using a channel. A set of consumers will ask for a task when idle, received it and consume it. The intermediate buffer will store a set of tasks in a priority queue buffer, so highest priority tasks are served first. The program also prints the Garbage collector activity (how many times it was invoked and how many time it took to collect the garbage).
I'm running it on a Raspberry Pi using Go 1.1.
The software seems to work fine but I found that at SO level, htop shows that there are 4 processes running, with the same memory use, and the sum of the CPU use is over 100% (the Raspberry Pi only has one core so I suppose it has something to do with threads/processes). Also the system load is about 7% of the CPU, I suppose because of a constant context switching at OS level. The GOMAXPROCS environment variable is set to 1 or unset.
Do you know why Go is using more than one OS process?
Code can be found here: http://pastebin.com/HJUq6sab
Thank you!
EDIT:
It seems that htop shows the lightweight processes of the system. Go programs run several of these lightweight processes (they are different from the goroutines threads) so using htop shows several processes, while psor top will show just one, as it should be.

Please try to kill all of the suspect processes and try again running it only once. Also, do not use go run, at least for now - it blurs the number of running processes at minimum.
I suspect the other instances are simply leftovers from your previous development attempts (probably invoked through go run and not properly [indirectly] killed on SIGINT [hypothesis only]), especially because there's a 1 hour "timeout" at the end of 'main' (instead of a proper synchronization or select{}). A Go binary can spawn new threads, but it should never create new processes unless explicitly asked for that. Which is not the case of your code - it doesn't even import "os/exec" or "syscall" in the first place.
If my guess about the combination of go run and using a long timeout is really the culprit, then there's possibly some difference in the RP kernel wrt what the dev guys are using for testing.

ruby: How do i get the number of subprocess(fork) running

I want to limit the subprocesses count to 3. Once it hits 3 i wait until one of the processes stops and then execute a new one. I'm using Kernel.fork to start the process.
How do i get the number of running subprocesses? or is there a better way to do this?

A good question, but I don't think there's such a method in Ruby, at least not in the standard library. There's lots of gems out there....
This problem though sounds like a job for the Mutex class. Look up the section Condition Variables here on how to use Ruby's mutexes.

I usually have a Queue of tasks to be done, and then have a couple of threads consuming tasks until they receive an item indicating the end of work. There's an example in "Programming Ruby" under the Thread library. (I'm not sure if I should copy and paste the example to Stack Overflow - sorry)

My solution was to use trap("CLD"), to trap SIGCLD whenever a child process ended and decrease the counter (a global variable) of processes running.

Send CTRL+C to subprocess tree on Windows

I would like to run arbitrary console-based sub-processes and manage them from a single master process. The console based sub-processes communicate via stdin, stdout and stderr, and if you run them in a genuine console they terminate cleanly when you press CTRL+C. Some of them may in fact be a tree of processes, such as a batch script that runs an executable which may in turn run another executable to do some work. I would like to redirect their standard I/O (for example, so that I can show their output in a GUI window) and in certain circumstances to send them a CTRL+C event so that they will give up and terminate cleanly.
The following two diagrams show first the normal structure - one master process has four worker sub-processes, and some of those workers have their own subprocesses; and then what should happen when one of the workers needs to be stopped - it and all of its children should get the CTRL+C event, but no other processes should receive the CTRL+C event.
(source: livejournal.com)
Additionally, I would much prefer that there are no extra windows visible to the user.
Here's what I've tried (note that I'm working in Python, but solutions for C would still be helpful):
Spawning an extra intermediate process with CREATE_NEW_CONSOLE, and then having it spawn the worker process. Then have it call GenerateConsoleCtrlEvent(CTRL_C_EVENT, 0) when we want to kill the worker. Unfortunately, CREATE_NEW_CONSOLE seems to prevent me from redirecting the standard I/O channels, so I'm left with no easy way to get the output back to the main program.
Spawning an extra intermediate process with CREATE_NEW_PROCESS_GROUP, and then having it spawn the worker process. Then have it call GenerateConsoleCtrlEvent(CTRL_C_EVENT, 0) when we want to kill the worker. Somehow, this manages to send the CTRL+C only to the master process, which is completely useless. On closer inspection, GenerateConsoleCtrlEvent says that CTRL+C cannot be sent to process groups.
Spawning the subprocess with CREATE_NEW_PROCESS_GROUP. Then call GenerateConsoleCtrlEvent(CTRL_BREAK_EVENT, pid) to kill the worker. This is not ideal, because CTRL+BREAK is less friendly than CTRL+C and will probably result in a messier termination. (E.g. if it's a Python process, no KeyboardInterrupt can be caught and no finally blocks run.)
Is there any good way to do what I want? I can see that I could theoretically build on the first attempt and find some other way to communicate between the processes, but I am worried it will turn out to be extremely awkward. Are there good examples of other programs that achieve the same effect? It seems so simple that it can't be all that uncommon a requirement.

I don't know about managing/redirecting stdin et. al., but for managing the subprocess tree
have you considered using the Windows Job Objects api?
There are several other questions about managing process trees (How do I automatically destroy child processes in Windows? Performing equivalent of “Kill Process Tree” in c++ on windows) and it looks like the cleanest method if you can use it.
Chapter 5 of Windows Via C/C++ by Jeffery Richter has a good discussion on using CreateJobObject and the related APIs.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio