I have a pandas DataFrame with about 45,000 rows similar to:
from numpy import random
from pandas import DataFrame
df = DataFrame(random.rand(45000, 200))
I am trying to break up all the rows into a multiprocessing Queue like this:
from multiprocessing import Queue
rows = [idx_and_row[1] for idx_and_row in df.iterrows()]
my_queue = Queue(maxsize = 0)
for idx, r in enumerate(rows):
# print(idx)
my_queue.put(r)
But when I run it, only about 37,000 things get put into my_queue and then it the program raises the following error:
raise Full
queue.Full
What is happening and how can I fix it?
The multiprocessing.Queue is designed for inter-process communication. It is not intended for storing large amount of data. For that purpose, I'd suggest to use Redis or Memcached.
Usually, the queue maximum size is platform dependent, even if you set it to 0. You have no easy way to workaround that.
It seems that on windows, the maximum amount of objects in a multiprocessing.Queue is infinite, but on Linux and MacOS the maximum size is 32767, which is 215 - 1, here is the significance of that number.
I solved the program by making an empty Queue object and then passing it to all the processes I wanted to pass it to, plus another process. The additional process is responsible for filling the queue with 10,000 rows, and checking it every few seconds to see if the queue has been emptied. When its empty, another 10,000 rows are added. This way, all 45,000 row is processed.
Related
A simple dask cache example. Cache does not work as expected. Let's assume we have a list of data and a series of delayed functions, expected that for a function that encounters the same input to cache/memoize the results according to cachey score.
This example demonstrates that is not the case.
import time
import dask
from dask.cache import Cache
from dask.diagnostics import visualize
from dask.diagnostics import Profiler, ResourceProfiler, CacheProfiler
def slow_func(x):
time.sleep(5)
return x+1
output = []
data = np.ones((100))
for x in data:
a = dask.delayed(slow_func)(x)
output.append(a)
total = dask.delayed(sum)(output)
cache = Cache(2e9)
cache.register()
with Profiler() as prof, ResourceProfiler(dt=0.25) as rprof,CacheProfiler() as cprof:
total.compute()
visualize([prof, rprof, cprof])
cache cprof plot
After the initial parallel execution of the function would expect the next iteration upon calling the function with the same value to use a cache version. But obviously does not, dask_key_name is for designating the same output, but i want to assess this function for a variety of inputs and if seeing the same input use cached version. We can tell if this is happening very easily with this function due to the 5 second delay and should see it execute roughly 5 seconds as soon as the first value is cached after execution. This example executes every single function delayed 5 seconds. I am able to create a memoized version using the cachey library directly but this should work using the dask.cache library.
In dask.delayed you may need to specify the pure=True keyword.
You can verify that this worked because all of your dask delayed values will have the same key.
You don't need to use Cache for this if they are all in the same dask.compute call.
I have an EA set in place that loops history trades and builds one large string with trade information. I then send this string every second from MT4 to the python backend using a plain PUSH/PULL pattern.
For whatever reason, the data isn't received on the pull side when the string transferred becomes too long. The backend PULL-socket slices each string and further processes it.
Any chance that the PULL-side is too slow to grab and process all the data which then causes an overflow (so that a delay arises due to the processing part)?
Talking about file sizes we are well below 5kb per second.
This is the PULL-socket, which manipulates the data after receiving it:
while True:
# check 24/7 for available data in the pull socket
try:
msg = zmq_socket.recv_string()
data = msg.split("|")
print(data)
# if data is available and msg is account info, handle as follows
if data[0] == "account_info":
[...]
except zmq.error.Again:
print("\nResource timeout.. please try again.")
sleep(0.000001)
I am a bit curious now since the pull socket seems to not even be able to process a string containing 40 trades with their according information on a single MT4 client - Python connection. I actually planned to set it up to handle more than 5.000 MT4 clients - python backend connections at once.
Q : Any chance that the pull side is too slow to grab and process all the data which then causes an overflow (so that a delay arises due to the processing part)?
Zero chance.
Sending 640 B each second is definitely no showstopper ( 5kb per second - is nowhere near a performance ceiling... )
The posted problem formulation is otherwise undecidable.
Step 1) POSACK/NACK prove whether a PUSH side accepts the payload for sending error-free.
Step 2) prove the PULL side is not to be blamed - [PUSH.send(640*chr(64+i)) for i in range( 10 )] via a python-2-python tcp://-transport-class solo-channel crossing host-to-host hop, over at least your local physical network ( no VMCI/emulated vLAN, no other localhost colocation )
Step 3) if either steps above got POSACK-ed, your next chances are the ZeroMQ configuration space and/or the MT4-based PUSH-side incompatibility, most probably "hidden" inside a (not mentioned) third party ZeroMQ wrapper used / first-party issues with string handling / processing ( which you must have already read about, as it has been so many times observed and mentioned in the past posts about this trouble with well "hidden" MQL4 internal eco-system changes ).
Anyway, stay tuned. ZeroMQ is a sure bet and a truly horsepower for professional and low-latency designs in distributed-system's domain.
I have recently started working with pythons multiprocessing library, and decided that using the Pool() and apply_async() approach is the most suitable for my problem. The code is quite long, but for this question I've compressed everything that isn't related to the multiprocessing in functions.
Background information
Basically, my program is supposed to take some data structure and send it to another program that will process it and write the results to a txt file. I have several thousands of these structures (N*M), and there are big chunks (M) that are independent and can be processed in any order. I created a worker pool to process these M structures before retrieving the next chunk. In order to process one structure, a new thread has to be created for the external program to run. The time spend outside the external program during the processing is less than 20 %, so if I check Task Manager, I can see the external program running under processes.
Actual problem
This works very well for a while, but after many processed structures (any number between 5000 and 20000) suddenly the external program stop showing up in the Task Manager and the python children runs at their individual peak performance (~13% cpu) without producing any more results. I don't understand what the problem might be. There are plenty of RAM left, and each child only use around 90 Mb. Also it is really weird that it works for quite some time and then stops. If I use ctrl-c, it stops after a few minutes, so it is semi-irresponsive to user input.
One thought I had was that when the timed-out external program thread is killed (which happens every now and then), maybe something isn't closed properly so that the child process is waiting for something it cannot find anymore? And if so, is there any better way of handling timed-out external processes?
from multiprocessing import Pool, TimeoutError
N = 500 # Number of chunks of data that can be multiprocessed
M = 80 # Independed chunk of data
timeout = 100 # Higher than any of the value for dataStructures.timeout
if __name__ == "__main__":
results = [None]*M
savedData = []
with Pool(processes=4) as pool:
for iteration in range(N):
dataStructures = [generate_data_structure(i) for i in range(M)]
#---Process data structures---
for iS, dataStructure in enumerate(dataStructures):
results[iS] = pool.apply_async(processing_func,(dataStructure,))
#---Extract processed data---
for iR, result in enumerate(results):
try:
processedData = result.get(timeout=timeout)
except TimeoutError:
print("Got TimeoutError.")
if processedData.someBool:
savedData.append(processedData)
Here is also the functions that create the new thread for the external program.
import subprocess as sp
import win32api as wa
import threading
def processing_func(dataStructure):
# Call another program that processes the data, and wait until it is finished/timed out
timedOut = RunCmd(dataStructure.command).start_process(dataStructure.timeout)
# Read the data from the other program, stored in a text file
if not timedOut:
processedData = extract_data_from_finished_thread()
else:
processedData = 0.
return processedData
class RunCmd(threading.Thread):
CREATE_NO_WINDOW = 0x08000000
def __init__(self, cmd):
threading.Thread.__init__(self)
self.cmd = cmd
self.p = None
def run(self):
self.p = sp.Popen(self.cmd, creationflags=self.CREATE_NO_WINDOW)
self.p.wait()
def start_process(self, timeout):
self.start()
self.join(timeout)
timedOut = self.is_alive()
# Kills the thread if timeout limit is reached
if timedOut:
wa.TerminateProcess(self.p._handle,-1)
self.join()
return timedOut
I am looking into parallelization of url requests onto one single webserver in python for the first time.
I would like to use requests_futures for this task as it seems that one can really split up processes onto several cores with the ProcessPoolExecutor.
The example code from the module documentation is:
from concurrent.futures import ThreadPoolExecutor
from requests_futures.sessions import FuturesSession
session = FuturesSession(executor=ThreadPoolExecutor(max_workers=2))
future_one = session.get('http://httpbin.org/get')
future_two = session.get('http://httpbin.org/get?foo=bar')
response_one = future_one.result()
print('response one status: {0}'.format(response_one.status_code))
print(response_one.content)
response_two = future_two.result()
print('response two status: {0}'.format(response_two.status_code))
print(response_two.content)
The above code works for me, however, I need some help with getting it customized to my needs.
I want to query the same server, let's say, 50 times (e.g. 50 different httpbin.org/get?... requests). What would be a good way to split these up onto different futures other than just defining future_one, ..._two and so on?
I am thinking about using different processes. According to the module documentation, it should be just a change in the first three lines of the above code:
from concurrent.futures import ProcessPoolExecutor
from requests_futures.sessions import FuturesSession
session = FuturesSession(executor=ProcessPoolExecutor(max_workers=2))
If I execute this I get the following error:
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
How do I get this running properly?
I have been trying to use subprocess32.Popen but this causes my system to crash (CPU 100%). So, I have the following code:
import subprocess32 as subprocess
for i in some_iterable:
output = subprocess.Popen(['/path/to/sh/file/script.sh',i[0],i[1],i[2],i[3],i[4],i[5]],shell=False,stdin=None,stdout=None,stderr=None,close_fds=True)
Before this, I had the following:
import subprocess32 as subprocess
for i in some_iterable:
output subprocess.check_output(['/path/to/sh/file/script.sh',i[0],i[1],i[2],i[3],i[4],i[5]])
.. and I had no problems with this - except that it was dead slow.
With Popen I see that this is fast - but my CPU goes too 100% in a couple of secs and the system crashes - forcing a hard reboot.
I am wondering what it is I am doing which is making Popen to crash?
On Linux,Python2.7 if that helps at all.
Thanks.
The problem is that you are trying to start 2 millon processes at once, which is blocking your system.
A solution would be to use a Pool to limit the maximum number of processes that can run at a time, and wait for each process to finish. For this cases where you're starting subprocesses and waiting for them (IO bound), a thread pool from the multiprocessing.dummy module would do:
import multiprocessing.dummy as mp
import subprocess32 as subprocess
def run_script(args):
args = ['/path/to/sh/file/script.sh'] + args
process = subprocess.Popen(args, close_fds=True)
# wait for exit and return exit code
# or use check_output() instead of Popen() if you need to process the output.
return process.wait()
# use a pool of 10 to allow at most 10 processes to be alive at a time
threadpool = mp.Pool(10)
# pool.imap or pool.imap_unordered should be used to avoid creating a list
# of all 2M return values in memory
results = threadpool.imap_unordered(run_script, some_iterable)
for result in results:
... # process result if needed
I've left out most of the arguments to Popen because you are using the default values anyway. The size of the pool should probably be in the range of your available CPU cores if your script is doing comutational work, if it's doing mostly IO (network access, writing files, ...) then probably more.