ipython notebook : how to parallelize external script - parallel-processing

I'm trying to use parallel computing from ipython parallel library. But I have little knowledge about it and I find the doc difficult to read from someone who knows nothing about parallel computing.
Funnily, all tutorials I found just re-use the example in the doc, with the same explanation, which from my point of view, is useless.
Basically what I'd like to do is running few scripts in background so they are executed in the same time. In bash it would be something like :
for my_file in $(cat list_file); do
python pgm.py my_file &
done
But bash interpreter of Ipython notebook doesn't handle the background mode.
It seems that solution was to use parallel library from ipython.
I tried :
from IPython.parallel import Client
rc = Client()
rc.block = True
dview = rc[:2] # I take only 2 engines
But then I'm stuck. I don't know how to run twice (or more) the same script or pgm at the same time.
Thanks.

One year later, I eventually managed to get what I wanted.
1) Create a function with what you want to do on the different cpu. Here it is just calling a script from the bash with the ! magic ipython command. I guess it would work with the call() function.
def my_func(my_file):
!python pgm.py {my_file}
Don't forget the {} when using !
Note also that the path to my_file should be absolute, since the clusters are where you started the notebook (when doing jupyter notebook or ipython notebook) which is not necessarily where you are.
2) Start your ipython notebook Cluster with the number of CPU you want.
Wait 2s and execute the following cell:
from IPython import parallel
rc = parallel.Client()
view = rc.load_balanced_view()
3) Get a list of file you want to process:
files = list_of_files
4) Map asynchronously your function with all your files to the view of your engines you just created. (not sure of the wording).
r = view.map_async(my_func, files)
While it's running you can do something else on the notebook (It runs in "background"!). You can also call r.wait_interactive() that enumerates interactively the number of files processed and the number of time spent so far and the number of files left. This will prevent you to run other cells (but you can interrupt it).
And if you have more files than engines, no worries, they will be processed as soon as an engine finishes with 1 file.
Hope this will help others !
This tutorial might be of some help:
http://nbviewer.ipython.org/github/minrk/IPython-parallel-tutorial/blob/master/Index.ipynb
Note also that I still have IPython 2.3.1, I don't know if it changed since Jupyter.
Edit: Still works with Jupyter, see here for difference and potential issues you may encounter
Note that if you use external libraries in your function, you need to import them on the different engines with:
%px import numpy as np
or
%%px
import numpy as np
import pandas as pd
Same with variable and other functions, you need to push them to the engine name space:
rc[:].push(dict(
foo=foo,
bar=bar))

If you're trying to executing some external scripts in parallel, you don't need to use IPython's parallel functionality. Replicating bash's parallel execution can be achieved with the subprocess module as follows:
import subprocess
procs = []
for i in range(10):
procs.append(subprocess.Popen(['ls', '/Users/shad/tmp/'], stdout=subprocess.PIPE))
results = []
for proc in procs:
stdout, _ = proc.communicate()
results.append(stdout)
Be wary that if your subprocess generates a lot of output, the process will block. If you print the output (results) you get:
print results
['file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n']

Related

python3: Run C file and python script in parallel

Ive got a snipped of my actual python script written in this post. Basically I want to have a C programm and a Pyserial function executed in parallel (the C programm is for controlling a motor, the pySerial is for communicating with a arduino). My programm will be executed on a RPi3b using Spyder3 and Rasbipian.
What Ive already figured out from the sources below is that if you want to have a terminal program executed in python you should use the subprocess class. If you want to execute something in parallel the Process package from multiprocessing will do the job.
So Ive mixed them together and tried to archive my goals by using the code bleow. Unluckly without any success. The p1 process immediatly starts after the p1 process is called [ p1 = Process(target=run_c_file()) ] and the script stops until the C file has finished. Does anyone out there can help? Thank you very much!
BTW Im using python 3.5...
My sources:
https://docs.python.org/3.5/library/multiprocessing.html , https://docs.python.org/3.5/library/subprocess.html?highlight=subprocess
import serial_comm as ssf #My own function. Tested and working when single calling
import subprocess as sub
from multiprocessing import Process
def run_c_file():
sub.run("./C_File") #Call the C File in the same directory. Immeidatly starts when script is at line 14 -> p1 = Process(target=run_c_file())
def run_pyserial(ser_obj):
ssf.command(ser_obj,"Command") #Tell the arduino to do something fancy (tested and working)
ser_obj = ssf.connect()
p1 = Process(target=run_c_file())
p2 = Process(target=run_pyserial(ser_obj))
try:
p1.start()
p2.start()
p1.join() #Process one should start here (as far as I understood)
p2.join() #Process two should start here (as far as I understood)
'''The following part is still in progress'''
except KeyboardInterrupt:
print("Aborting")
p1.terminate()
p2.terminate()
try
p1 = Process(target=run_c_file)
p2 = Process(target=run_pyserial, args=(ser_obj,))
currently you are calling the function instead of passing it.

How to clear cache (or force recompilation) in numba

I have a fairly large codebase written in numba, and I have noticed that when the cache is enabled for a function calling another numba compiled function in another file, changes in the called function are not picked up when the called function is changed. The situation occurs when I have two files:
testfile2:
import numba
#numba.njit(cache=True)
def function1(x):
return x * 10
testfile:
import numba
from tests import file1
#numba.njit(cache=True)
def function2(x, y):
return y + file1.function1(x)
If in a jupyter notebook, I run the following:
# INSIDE JUPYTER NOTEBOOK
import sys
sys.path.insert(1, "path/to/files/")
from tests import testfile
testfile.function2(3, 4)
>>> 34 # good value
However, if I change then change testfile2 to the following:
import numba
#numba.njit(cache=True)
def function1(x):
return x * 1
Then I restart the jupyter notebook kernel and rerun the notebook, I get the following
import sys
sys.path.insert(1, "path/to/files/")
from tests import testfile
testfile.function2(3, 4)
>>> 34 # bad value, should be 7
Importing both files into the notebook has no effect on the bad result. Also, setting cache=False only on function1 also has no effect. What does work is setting cache=False on all njit'ted functions, then restarting the kernel, then rerunning.
I believe that LLVM is probably inlining some of the called functions and then never checking them again.
I looked in the source and discovered there is a method that returns the cache object numba.caching.NullCache(), instantiated a cache object and ran the following:
cache = numba.caching.NullCache()
cache.flush()
Unfortunately that appears to have no effect.
Is there a numba environment setting, or another way I can manually clear all cached functions within a conda env? Or am I simply doing something wrong?
I am running numba 0.33 with Anaconda Python 3.6 on Mac OS X 10.12.3.
I "solved" this with a hack solution after seeing Josh's answer, by creating a utility in the project method to kill off the cache.
There is probably a better way, but this works. I'm leaving the question open in case someone has a less hacky way of doing this.
import os
def kill_files(folder):
for the_file in os.listdir(folder):
file_path = os.path.join(folder, the_file)
try:
if os.path.isfile(file_path):
os.unlink(file_path)
except Exception as e:
print("failed on filepath: %s" % file_path)
def kill_numba_cache():
root_folder = os.path.realpath(__file__ + "/../../")
for root, dirnames, filenames in os.walk(root_folder):
for dirname in dirnames:
if dirname == "__pycache__":
try:
kill_files(root + "/" + dirname)
except Exception as e:
print("failed on %s", root)
This is a bit of a hack, but it's something I've used before. If you put this function in the top-level of where your numba functions are (for this example, in testfile), it should recompile everything:
import inspect
import sys
def recompile_nb_code():
this_module = sys.modules[__name__]
module_members = inspect.getmembers(this_module)
for member_name, member in module_members:
if hasattr(member, 'recompile') and hasattr(member, 'inspect_llvm'):
member.recompile()
and then call it from your jupyter notebook when you want to force a recompile. The caveat is that it only works on files in the module where this function is located and their dependencies. There might be another way to generalize it.

why pool.map in python doesn't work

import multiprocessing as mul
def f(x):
return x**2
pool = mul.Pool(5)
rel = pool.map(f,[1,2,3,4,5,6,7,8,9,10])
print(rel)
When I run the program above, the application is stuck in a loop and can't stop.
I am using python 3.5 in windows, is there something wrong?
This is what I see on my screen:
I am new to finance data analysis; and I am trying to find out a way to solve the big data problem with parallel computing.
Its not working because you are typing the commands in a shell; try saving the code in a file and running it directly.
Don't forget to copy the code correctly, you were missing a very important if statement (see the documentation).
Save this to a file, for example example.py on the desktop:
import multiprocessing as mul
def f(x):
return x**2
if __name__ == '__main__':
pool = mul.Pool(5)
rel = pool.map(f,[1,2,3,4,5,6,7,8,9,10])
print(rel)
Then, open a command prompt and type:
python %USERPROFILE%\Desktop\example.py

Python multiprocessing stdin input

All code written and tested on python 3.4 windows 7.
I was designing a console app and had a need to use stdin from command-line (win os) to issue commands and to change the operating mode of the program. The program depends on multiprocessing to deal with cpu bound loads to spread to multiple processors.
I am using stdout to monitor that status and some basic return information and stdin to issue commands to load different sub-processes based on the returned console information.
This is where I found a problem. I could no get the multiprocessing module to accept stdin inputs but stdout was working just fine. I think found the following help on stack So I tested it and found that with the threading module this all works great, except for the fact that all output to stdout is paused until each time stdin is cycled due to GIL lock with stdin blocking.
I will say I have been successful with a work around implemented with msvcrt.kbhit(). However, I can't help but wonder if there is some sort of bug in the multiprocessing feature that is making stdin not read any data. I tried numerous ways and nothing worked when using multiprocessing. Even attempted to use Queues, but I did not try pools, or any other methods from multiprocessing.
I also did not try this on my linux machine since I was focusing on trying to get it to work.
Here is simplified test code that does not function as intended (reminder this was written in Python 3.4 - win7):
import sys
import time
from multiprocessing import Process
def function1():
while True:
print("Function 1")
time.sleep(1.33)
def function2():
while True:
print("Function 2")
c = sys.stdin.read(1) # Does not appear to be waiting for read before continuing loop.
sys.stdout.write(c) #nothing in 'c'
sys.stdout.write(".") #checking to see if it works at all.
print(str(c)) #trying something else, still nothing in 'c'
time.sleep(1.66)
if __name__ == "__main__":
p1 = Process(target=function1)
p2 = Process(target=function2)
p1.start()
p2.start()
Hopefully someone can shed light on whether this is intended functionality, if I didn't implement it correctly, or some other useful bit of information.
Thanks.
When you take a look at Pythons implementation of multiprocessing.Process._bootstrap() you will see this:
if sys.stdin is not None:
try:
sys.stdin.close()
sys.stdin = open(os.devnull)
except (OSError, ValueError):
pass
You can also confirm this by using:
>>> import sys
>>> import multiprocessing
>>> def func():
... print(sys.stdin)
...
>>> p = multiprocessing.Process(target=func)
>>> p.start()
>>> <_io.TextIOWrapper name='/dev/null' mode='r' encoding='UTF-8'>
And reading from os.devnull immediately returns empty result:
>>> import os
>>> f = open(os.devnull)
>>> f.read(1)
''
You can work this around by using open(0):
file is either a string or bytes object giving the pathname (absolute or relative to the current working directory) of the file to be opened or an integer file descriptor of the file to be wrapped. (If a file descriptor is given, it is closed when the returned I/O object is closed, unless closefd is set to False.)
And "0 file descriptor":
File descriptors are small integers corresponding to a file that has been opened by the current process. For example, standard input is usually file descriptor 0, standard output is 1, and standard error is 2:
>>> def func():
... sys.stdin = open(0)
... print(sys.stdin)
... c = sys.stdin.read(1)
... print('Got', c)
...
>>> multiprocessing.Process(target=func).start()
>>> <_io.TextIOWrapper name=0 mode='r' encoding='UTF-8'>
Got a

Uncaught Throw generated by JLink or UseFrontEnd

This example routine generates two Throw::nocatch warning messages in the kernel window. Can they be handled somehow?
The example consists of this code in a file "test.m" created in C:\Temp:
Needs["JLink`"];
$FrontEndLaunchCommand = "Mathematica.exe";
UseFrontEnd[NotebookWrite[CreateDocument[], "Testing"]];
Then these commands pasted and run at the Windows Command Prompt:
PATH = C:\Program Files\Wolfram Research\Mathematica\8.0\;%PATH%
start MathKernel -noprompt -initfile "C:\Temp\test.m"
Addendum
The reason for using UseFrontEnd as opposed to UsingFrontEnd is that an interactive front end may be required to preserve output and messages from notebooks that are usually run interactively. For example, with C:\Temp\test.m modified like so:
Needs["JLink`"];
$FrontEndLaunchCommand="Mathematica.exe";
UseFrontEnd[
nb = NotebookOpen["C:\\Temp\\run.nb"];
SelectionMove[nb, Next, Cell];
SelectionEvaluate[nb];
];
Pause[10];
CloseFrontEnd[];
and a notebook C:\Temp\run.nb created with a single cell containing:
x1 = 0; While[x1 < 1000000,
If[Mod[x1, 100000] == 0,
Print["x1=" <> ToString[x1]]]; x1++];
NotebookSave[EvaluationNotebook[]];
NotebookClose[EvaluationNotebook[]];
this code, launched from a Windows Command Prompt, will run interactively and save its output. This is not possible to achieve using UsingFrontEnd or MathKernel -script "C:\Temp\test.m".
During the initialization, the kernel code is in a mode which prevents aborts.
Throw/Catch are implemented with Abort, therefore they do not work during initialization.
A simple example that shows the problem is to put this in your test.m file:
Catch[Throw[test]];
Similarly, functions like TimeConstrained, MemoryConstrained, Break, the Trace family, Abort and those that depend upon it (like certain data paclets) will have problems like this during initialization.
A possible solution to your problem might be to consider the -script option:
math.exe -script test.m
Also, note that in version 8 there is a documented function called UsingFrontEnd, which does what UseFrontEnd did, but is auto-configured, so this:
Needs["JLink`"];
UsingFrontEnd[NotebookWrite[CreateDocument[], "Testing"]];
should be all you need in your test.m file.
See also: Mathematica Scripts
Addendum
One possible solution to use the -script and UsingFrontEnd is to use the 'run.m script
included below. This does require setting up a 'Test' kernel in the kernel configuration options (basically a clone of the 'Local' kernel settings).
The script includes two utility functions, NotebookEvaluatingQ and NotebookPauseForEvaluation, which help the script to wait for the client notebook to finish evaluating before saving it. The upside of this approach is that all the evaluation control code is in the 'run.m' script, so the client notebook does not need to have a NotebookSave[EvaluationNotebook[]] statement at the end.
NotebookPauseForEvaluation[nb_] := Module[{},While[NotebookEvaluatingQ[nb],Pause[.25]]]
NotebookEvaluatingQ[nb_]:=Module[{},
SelectionMove[nb,All,Notebook];
Or##Map["Evaluating"/.#&,Developer`CellInformation[nb]]
]
UsingFrontEnd[
nb = NotebookOpen["c:\\users\\arnoudb\\run.nb"];
SetOptions[nb,Evaluator->"Test"];
SelectionMove[nb,All,Notebook];
SelectionEvaluate[nb];
NotebookPauseForEvaluation[nb];
NotebookSave[nb];
]
I hope this is useful in some way to you. It could use a few more improvements like resetting the notebook's kernel to its original and closing the notebook after saving it,
but this code should work for this particular purpose.
On a side note, I tried one other approach, using this:
UsingFrontEnd[ NotebookEvaluate[ "c:\\users\\arnoudb\\run.nb", InsertResults->True ] ]
But this is kicking the kernel terminal session into a dialog mode, which seems like a bug
to me (I'll check into this and get this reported if this is a valid issue).

Resources