I come here because I have an issue with my Jupiter's Python3 notebook.
I need to create a function that uses the multiprocessing library.
Before to implement it, I make some tests.
I found a looooot of different examples but the issue is everytime the same : my code is executed but nothing happens in the notebook's interface :
The code i try to run on jupyter is this one :
import os
from multiprocessing import Process, current_process
def doubler(number):
"""
A doubling function that can be used by a process
"""
result = number * 2
proc_name = current_process().name
print('{0} doubled to {1} by: {2}'.format(
number, result, proc_name))
return result
if __name__ == '__main__':
numbers = [5, 10, 15, 20, 25]
procs = []
proc = Process(target=doubler, args=(5,))
for index, number in enumerate(numbers):
proc = Process(target=doubler, args=(number,))
proc2 = Process(target=doubler, args=(number,))
procs.append(proc)
procs.append(proc2)
proc.start()
proc2.start()
proc = Process(target=doubler, name='Test', args=(2,))
proc.start()
procs.append(proc)
for proc in procs:
proc.join()
It's OK when I just run my code without Jupyter but with the command "python my_progrem.py" and I can see the logs :
Is there, for my example, and in Jupyter, a way to catch the results of my two tasks (proc1 and proc2 which both call thefunction "doubler") in a variable/object that I could use after ?
If "yes", how can I do it?
#Konate's answer really helped me. Here is a simplified version using multiprocessing.pool:
import multiprocessing
def double(a):
return a * 2
def driver_func():
PROCESSES = 4
with multiprocessing.Pool(PROCESSES) as pool:
params = [(1, ), (2, ), (3, ), (4, )]
results = [pool.apply_async(double, p) for p in params]
for r in results:
print('\t', r.get())
driver_func()
I succeed by using multiprocessing.pool.
I was inspired by this approach :
def test():
PROCESSES = 4
print('Creating pool with %d processes\n' % PROCESSES)
with multiprocessing.Pool(PROCESSES) as pool:
TASKS = [(mul, (i, 7)) for i in range(10)] + \
[(plus, (i, 8)) for i in range(10)]
results = [pool.apply_async(calculate, t) for t in TASKS]
imap_it = pool.imap(calculatestar, TASKS)
imap_unordered_it = pool.imap_unordered(calculatestar, TASKS)
print('Ordered results using pool.apply_async():')
for r in results:
print('\t', r.get())
print()
print('Ordered results using pool.imap():')
for x in imap_it:
print('\t', x)
...etc
For more, the code is at : https://docs.python.org/3.4/library/multiprocessing.html?
Another way of running multiprocessing jobs in a Jupyter notebook is to use one of the approaches supported by the nbmultitask package.
This works for me on MAC (cannot make it work on windows):
import multiprocessing as mp
mp_start_count = 0
if __name__ == '__main__':
if mp_start_count == 0:
mp.set_start_method('fork')
mp_start_count += 1
Related
I'm self study of Python and it's my first code.
I'm working for analyze logs from the servers. Usually I need analyze full day logs. I created script (this is example, simple logic) just for check speed. If I use normal coding the duration of analyzing 20mil rows about 12-13 minutes. I need 200mil rows by 5 min.
What I tried:
Use multiprocessing (met issue with share memory, think that fix it). But as the result - 300K rows = 20 sec and no matter how many processes. (PS: Also need control processors count in advance)
Use threading (I found that it's not give any speed, 300K rows = 2 sec. But normal code same, 300K = 2 sec)
Use asyncio (I think that script is slow because need reads many files). Result same as threading - 300K = 2 sec.
Finally I think that all three my script incorrect and didn't work correctly.
PS: I try to avoid use specific python modules (like pandas) because in this case it will be more difficult to execute on different servers. Better to use common lib.
Please help to check 1st - multiprocessing.
import csv
import os
from multiprocessing import Process, Queue, Value, Manager
file = {"hcs.log", "hcs1.log", "hcs2.log", "hcs3.log"}
def argument(m, a, n):
proc_num = os.getpid()
a_temp_m = a["vod_miss"]
a_temp_h = a["vod_hit"]
with open(os.getcwd() + '/' + m, newline='') as hcs_1:
hcs_2 = csv.reader(hcs_1, delimiter=' ')
for j in hcs_2:
if j[3].find('MISS') != -1:
a_temp_m[n] = a_temp_m[n] + 1
elif j[3].find('HIT') != -1:
a_temp_h[n] = a_temp_h[n] + 1
a["vod_miss"][n] = a_temp_m[n]
a["vod_hit"][n] = a_temp_h[n]
if __name__ == '__main__':
procs = []
manager = Manager()
vod_live_cuts = manager.dict()
i = "vod_hit"
ii = "vod_miss"
cpu = 1
n = 1
vod_live_cuts[i] = manager.list([0] * cpu)
vod_live_cuts[ii] = manager.list([0] * cpu)
for m in file:
proc = Process(target=argument, args=(m, vod_live_cuts, (n-1)))
procs.append(proc)
proc.start()
if n >= cpu:
n = 1
proc.join()
else:
n += 1
[proc.join() for proc in procs]
[proc.close() for proc in procs]
I'm expect, each file by def argument will be processed by independent process and finally all results will be saved in dict vod_live_cuts. For each process I added independent list in dict. I think it will help cross operation for use this parameter. But maybe it's wrong way :(
using IPC is costly, so only use "shared objects" for saving the final result, not for intermediate results while parsing the file.
limiting the number of processes is done by using a multiprocessing.Pool, the following code uses it to reach the max hard-disk speed, you only need to post-process the results.
you can only parse data as fast as your HDD can read it (typically 30-80 MB/s), so if you need to improve the performance further you should use SSD or RAID0 for higher disk speed, you cannot get much faster than this without changing your hardware.
import csv
import os
from multiprocessing import Process, Queue, Value, Manager, Pool
file = {"hcs.log", "hcs1.log", "hcs2.log", "hcs3.log"}
def argument(m, a):
proc_num = os.getpid()
a_temp_m_n = 0 # make it local to process
a_temp_h_n = 0 # as shared lists use IPC
with open(os.getcwd() + '/' + m, newline='') as hcs_1:
hcs_2 = csv.reader(hcs_1, delimiter=' ')
for j in hcs_2:
if j[3].find('MISS') != -1:
a_temp_m_n = a_temp_m_n + 1
elif j[3].find('HIT') != -1:
a_temp_h_n = a_temp_h_n + 1
a["vod_miss"].append(a_temp_m_n)
a["vod_hit"].append(a_temp_h_n)
if __name__ == '__main__':
manager = Manager()
vod_live_cuts = manager.dict()
i = "vod_hit"
ii = "vod_miss"
cpu = 1
vod_live_cuts[i] = manager.list()
vod_live_cuts[ii] = manager.list()
with Pool(cpu) as pool:
tasks = []
for m in file:
task = pool.apply_async(argument, args=(m, vod_live_cuts))
tasks.append(task)
for task in tasks:
task.get()
print(list(vod_live_cuts[i]))
print(list(vod_live_cuts[ii]))
I have two functions which are timeconsuming to be run. I wanna run each using multiprocessing library in python. I see some examples but I don't know once the processors have done with their calculations, how retrieve the output of each and sum up the total results? Each function returns a value.
For example:
from multiprocessing import Pool
import time
def f1(n):
time.sleep(0.5)
global f1out
f1out = n**2
return n**2
def f2(n):
time.sleep(0.5)
global f2out
f2out = n**2
return n**3
start = time.time()
if __name__ == "__main__":
results1 = Process(target=f1,args=(range(10)))
results2 = Process(target=f2,args=(range(10)))
results1.start()
results2.start()
results1.join()
results2.join()
print(results1)
print(results2)
end = time.time()
TT = end-start
print(TT)
I wanna calculate the (results1 + results2)
but results1 and 2 are not values!
I am having a similar problem on this link: Parallel error with GridSearchCV, works fine with other methods
I tried both of the solutions and neither one worked for me, as well.
When n_jobs = -1 in a grid search I get an error although n_jobs = -1 works fine for single models.
I tried updating sklearn but it did not help.
Here is the code I am trying:
rf = RandomForestClassifier()
rf_random = RandomizedSearchCV(estimator=rf,
param_distributions = random_grid,
n_iter = 100, cv = 3, verbose = 2, random_state = 42,
n_jobs = -1)
rf_random.fit(X_train, y_train)
I am getting this error:
ModuleNotFoundError: No module named
'sklearn.externals.joblib.externals.loky.backend.popen_loky_win32'
Tried the solution on the link but got the same error:
def randomsearcher():
clf = ensemble.RandomForestClassifier()
param_grid = random_grid
grid_s= model_selection.GridSearchCV(clf, cv=5, param_grid=param_grid
,n_jobs=-1,verbose=1)
grid_s.fit(X_train,y_train)
return grid_s
if __name__ == '__main__':
randomsearcher()
No problem with this code:
knn = KNeighborsClassifier(n_neighbors=50,
weights='distance',algorithm='auto',n_jobs = -1 )
I am working on a VM that has 8 virtual processors on 2 sockets.
I want to solve a set of contained problems in parallel, after which addition information is added to solve a new problem.
Below is an example of the structure of the program used to solve the problem:
from z3 import *
import concurrent.futures
# solver test function
def add(a, b, solver, index):
solver.append(a > b)
assert solver.check()
model = solver.model()
return {
'solver': solver,
'av': model[a],
'a': a,
'b': b,
'bv': model[b],
'index': index
}
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# start solving the problems
to_compute = []
for i in range(3):
sol = z3.Solver()
to_compute.append(executor.submit(
add,
Int('a{}'.format(i)),
Int('b{}'.format(i)),
sol,
i
))
# wait for the solution to the computations
next_to_solve = []
for result_futures in concurrent.futures.as_completed(to_compute):
results = result_futures.result()
print(results)
sol = results['solver']
sol.append(results['a'] > results['av'])
next_to_solve.append(
executor.submit(
add,
results['a'],
results['b'],
sol,
results['index']
)
)
The results is different each time the program is run, the results includes:
Z3Exception 'invalid dec_ref command'
python that crash
No error
What do I need to do, to make the program more reliable?
Did you see this example: http://github.com/Z3Prover/z3/blob/master/examples/python/parallel.py
I'm not an expert on the concurrent features in z3py, but it seems to be you need to be very careful about creating the variables in the same context that you're running the solvers in. There are some hints in that very file.
Using Mac OS X API, I'm trying to save a PDF file with a Quartz filter applied, just like it is possible from the "Save As" dialog in the Preview application. So far I've written the following code (using Python and pyObjC, but it isn't important for me):
-- filter-pdf.py: begin
from Foundation import *
from Quartz import *
import objc
page_rect = CGRectMake (0, 0, 612, 792)
fdict = NSDictionary.dictionaryWithContentsOfFile_("/System/Library/Filters/Blue
\ Tone.qfilter")
in_pdf = CGPDFDocumentCreateWithProvider(CGDataProviderCreateWithFilename ("test
.pdf"))
url = CFURLCreateWithFileSystemPath(None, "test_out.pdf", kCFURLPOSIXPathStyle,
False)
c = CGPDFContextCreateWithURL(url, page_rect, fdict)
np = CGPDFDocumentGetNumberOfPages(in_pdf)
for ip in range (1, np+1):
page = CGPDFDocumentGetPage(in_pdf, ip)
r = CGPDFPageGetBoxRect(page, kCGPDFMediaBox)
CGContextBeginPage(c, r)
CGContextDrawPDFPage(c, page)
CGContextEndPage(c)
-- filter-pdf.py: end
Unfortunalte, the filter "Blue Tone" isn't applied, the output PDF looks exactly as the input PDF.
Question: what I missed? How to apply a filter?
Well, the documentation doesn't promise that such way of creating and using "fdict" should cause that the filter is applied. But I just rewritten (as far as I can) sample code /Developer/Examples/Quartz/Python/filter-pdf.py, which was distributed with older versions of Mac (meanwhile, this code doesn't work too):
----- filter-pdf-old.py: begin
from CoreGraphics import *
import sys, os, math, getopt, string
def usage ():
print '''
usage: python filter-pdf.py FILTER INPUT-PDF OUTPUT-PDF
Apply a ColorSync Filter to a PDF document.
'''
def main ():
page_rect = CGRectMake (0, 0, 612, 792)
try:
opts,args = getopt.getopt (sys.argv[1:], '', [])
except getopt.GetoptError:
usage ()
sys.exit (1)
if len (args) != 3:
usage ()
sys.exit (1)
filter = CGContextFilterCreateDictionary (args[0])
if not filter:
print 'Unable to create context filter'
sys.exit (1)
pdf = CGPDFDocumentCreateWithProvider (CGDataProviderCreateWithFilename (args[1]))
if not pdf:
print 'Unable to open input file'
sys.exit (1)
c = CGPDFContextCreateWithFilename (args[2], page_rect, filter)
if not c:
print 'Unable to create output context'
sys.exit (1)
for p in range (1, pdf.getNumberOfPages () + 1):
#r = pdf.getMediaBox (p)
r = pdf.getPage(p).getBoxRect(p)
c.beginPage (r)
c.drawPDFDocument (r, pdf, p)
c.endPage ()
c.finish ()
if __name__ == '__main__':
main ()
----- filter-pdf-old.py: end
=======================================================================
The working code based on the answer:
from Foundation import *
from Quartz import *
pdf_url = NSURL.fileURLWithPath_("test.pdf")
pdf_doc = PDFDocument.alloc().initWithURL_(pdf_url)
furl = NSURL.fileURLWithPath_("/System/Library/Filters/Blue Tone.qfilter")
fobj = QuartzFilter.quartzFilterWithURL_(furl)
fdict = { 'QuartzFilter': fobj }
pdf_doc.writeToFile_withOptions_("test_out.pdf", fdict)
two approaches - if you need to open and modify an already existing file, use the PDFKit's PDFDocument (reference) and use PDFDocument's writeToFile_withOptions_ with option dict including the "QuartzFilter" option of needed filter.
OTOH if you need your own drawing and have a CGContext at hand, you can use something along these lines:
from Quartz import *
data = NSMutableData.dataWithCapacity_(1024**2)
dataConsumer = CGDataConsumerCreateWithCFData(data)
context = CGPDFContextCreate(dataConsumer, None, None)
f = QuartzFilter.quartzFilterWithURL_(NSURL.fileURLWithPath_("YourFltr.qfilter"))
f.applyToContext_(context)
# do your drawing
CGPDFContextClose(context)
# the PDF is in the data variable. Do whatever you need to do with the data (save to file...).