Alternative use patterns for python multiprocessing avoiding proliferation of global state? - multiprocessing

This (enormously simplified example) works fine (Python 2.6.6, Debian Squeeze):
from multiprocessing import Pool
import numpy as np
src=None
def process(row):
return np.sum(src[row])
def main():
global src
src=np.ones((100,100))
pool=Pool(processes=16)
rows=pool.map(process,range(100))
print rows
if __name__ == "__main__":
main()
however, after years of being taught global state bad!!!, all my instincts are telling me I really really would rather be writing something closer to:
from multiprocessing import Pool
import numpy as np
def main():
src=np.ones((100,100))
def process(row):
return np.sum(src[row])
pool=Pool(processes=16)
rows=pool.map(process,range(100))
print rows
if __name__ == "__main__":
main()
but of course that doesn't work (hangs up unable to pickle something).
The example here is trivial, but by the time you add multiple "process" functions, and each of those is dependent on multiple additional inputs... well it all becomes a bit reminiscent of something written in BASIC 30 years ago. Trying to use classes to at least aggregate the state with the appropriate functions seems an obvious solution, but doesn't seem to be that easy in practice.
Is there some recommended pattern or style for using multiprocessing.pool which will avoid the proliferation of global state to support each function I want to parallel map over ?
How do experienced "multiprocessing pros" deal with this ?
Update: Note that I'm actually interested in processing much bigger arrays, so variations on the above which pickle src each call/iteration aren't nearly as good as ones which fork it into the pool's worker processes.

You could always pass a callable object like this, then the object can containe the shared state:
from multiprocessing import Pool
import numpy as np
class RowProcessor(object):
def __init__(self, src):
self.__src = src
def __call__(self, row):
return np.sum(self.__src[row])
def main():
src=np.ones((100,100))
p = RowProcessor(src)
pool=Pool(processes=16)
rows = pool.map(p, range(100))
print rows
if __name__ == "__main__":
main()

Related

On-the-fly tokenization with datasets, tokenizers, and torch Datasets and Dataloaders

I have a question regarding "on-the-fly" tokenization. This question was elicited by reading the "How to train a new language model from scratch using Transformers and Tokenizers" here. Towards the end there is this sentence: "If your dataset is very large, you can opt to load and tokenize examples on the fly, rather than as a preprocessing step". I've tried coming up with a solution that would combine both datasets and tokenizers, but did not manage to find a good pattern.
I guess the solution would entail wrapping a dataset into a Pytorch dataset.
As a concrete example from the docs
import torch
class SquadDataset(torch.utils.data.Dataset):
def __init__(self, encodings):
# instead of doing this beforehand, I'd like to do tokenization on the fly
self.encodings = encodings
def __getitem__(self, idx):
return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
def __len__(self):
return len(self.encodings.input_ids)
train_dataset = SquadDataset(train_encodings)
How would one implement this with "on-the-fly" tokenization exploiting the vectorized capabilities of tokenizers?
UPDATE Feb 2021
As of v1.3.0 datasets supports lazy evaluation of functions via the set_transform method. Therefore, you can apply on-the-fly tokenization directly like shown here.
OLD ANSWER
In the end I settled for this solution. I do not like that the batch_size is now controlled at the dataset level. However, it does its job.
In this way we exploit two nice things:
fast indexing the HuggingFace datasets
vectorization capabilities of the HuggingFace tokenizer
class CustomPytorchDataset(Dataset):
"""
This class wraps the HuggingFace dataset and allows for
batch indexing into the dataset. This allows exploiting
the capabilities of the tokenizer to work on batches.
NOTE: now we control batch_size at the Dataset level, not
in the DataLoader therefore the DataLoader should always be
used with `batch_size=1`.
"""
def __init__(self, batch_size: int):
self.batch_size = batch_size
self.dataset = train_ds # HuggingFace dataset
self.tokenizer = bert_tokenizer # HuggingFace tokenizer
def __getitem__(self, batch_idx: List[int]):
instance = self.dataset[batch_idx]
# tokenize on-the-fly
tokenized_instance = self.tokenizer(
instance[text_col],
truncation=True,
padding=True
)
return tokenized_instance
def __len__(self):
return len(self.dataset)
def sampler(self):
# shuffling can be controlled by the sampler,
# without touching the dataset
return BatchSampler(
SequentialSampler(self),
batch_size=self.batch_size,
drop_last=True
)
#staticmethod
def collate_fn(batches: List[Dict[str, int]]):
return {
k: torch.tensor(v, dtype=torch.int64)
for k, v in batches[0].items()
}

How to find all local minimums of a function efficiently

This question is related to global optimization and it is simpler. The task is to find all local minimums of a function. This is useful sometimes, for example, in physics we might want to find metastable states besides the true ground state in phase space. I have a naive implementation which has been tested on a scalar function xsin(x)+xcos(2*x) by randomly searching points in the interval. But clearly this is not efficient. The code and output are attached if you are interested.
#!/usr/bin/env python
from scipy import *
from numpy import *
from pylab import *
from numpy import random
"""
search all of the local minimums using random search when the functional form of the target function is known.
"""
def function(x):
return x*sin(x)+x*cos(2*x)
# return x**4-3*x**3+2
def derivative(x):
return sin(x)+x*cos(x)+cos(2*x)-2*x*sin(2*x)
# return 4.*x**3-9.*x**2
def ploting(xr,yr,mls):
plot(xr,yr)
grid()
for xm in mls:
axvline(x=xm,c='r')
savefig("plotf.png")
show()
def findlocmin(x,Nit,step_def=0.1,err=0.0001,gamma=0.01):
"""
we use gradient decent method to find local minumum using x as the starting point
"""
for i in range(Nit):
slope=derivative(x)
step=min(step_def,abs(slope)*gamma)
x=x-step*slope/abs(slope)
# print step,x
if(abs(slope)<err):
print "Found local minimum using "+str(i)+' iterations'
break
if i==Nit-1:
raise Exception("local min is not found using Nit=",str(Nit),'iterations')
return x
if __name__=="__main__":
xleft=-9;xright=9
xs=linspace(xleft,xright,100)
ys=array([function(x) for x in xs ])
minls=[]
Nrand=100;it=0
Nit=10000
while it<Nrand:
xint=random.uniform(xleft,xright)
xlocm=findlocmin(xint,Nit)
print xlocm
minls.append(xlocm)
it+=1
# print minls
ploting(xs,ys,minls)`]
I'd like to know if there exists better solution to this?

cython code continues after prange returns value

Suppose i have the following function:
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef bint test(np.int_t [:] values):
cdef Py_ssize_t n_values = len(values)
cdef int i
for i in prange(n_values,nogil=True):
if i ==0:
return 0
print 'test'
I run it like so:
In [1]: import algos
In [2]: import numpy as np
In [3]: algos.test(np.array([1,2,3,1,4,5]))
test
Out[3]: False
Why is the function printing when it should have just exited without printing? Is there a way to have the function exit when it reaches the return?
Thank you.
The documentation is clear that this is a bit of a minefield since there's no guarantee which iteration finishes first. I think the added complication that they don't make clear is that if the number of iterations is small enough then (provided one thread is done) you can also end up continuing as after the prange too, which is what you see.
What seems to work for me is to use the else clause of a loop, which only gets executed if it hasn't finished early:
for i in prange(n_values,nogil=True):
# stuff ...
else:
with gil:
print "test"
A quick look at the C code suggests that this is putting appropriate checks in place and it should be reliable.

valued stored in multiprocessing Value is not the same as the original

I am trying to store a bytes object inside of the multiprocessing class Value but I found that the value actually stored and the value I wanted to store are different. As an example the following code evaluates to False.
import ctypes
import dill
import multiprocessing as mp
x = [{str(i):i} for i in range(10)]
y = dill.dumps(x)
serialized_workbook = mp.Value(ctypes.c_char_p, dill.dumps(None))
serialized_workbook.value = y
serialized_workbook.value == y
Why does this happen and how can I prevent it from happening? The reason I want to do this is so that I can later load it again.
dill.loads(serialized_workbook.value)
The above fails while the following piece of code does not.
dill.loads(y)

pythonic way to do something N times without an index variable? [duplicate]

This question already has answers here:
Is it possible to implement a Python for range loop without an iterator variable?
(15 answers)
Closed 7 months ago.
I have some code like:
for i in range(N):
do_something()
I want to do something N times. The code inside the loop doesn't depend on the value of i.
Is it possible to do this simple task without creating a useless index variable, or in an otherwise more elegant way? How?
A slightly faster approach than looping on xrange(N) is:
import itertools
for _ in itertools.repeat(None, N):
do_something()
Use the _ variable, like so:
# A long way to do integer exponentiation
num = 2
power = 3
product = 1
for _ in range(power):
product *= num
print(product)
I just use for _ in range(n), it's straight to the point. It's going to generate the entire list for huge numbers in Python 2, but if you're using Python 3 it's not a problem.
since function is first-class citizen, you can write small wrapper (from Alex answers)
def repeat(f, N):
for _ in itertools.repeat(None, N): f()
then you can pass function as argument.
The _ is the same thing as x. However it's a python idiom that's used to indicate an identifier that you don't intend to use. In python these identifiers don't takes memor or allocate space like variables do in other languages. It's easy to forget that. They're just names that point to objects, in this case an integer on each iteration.
I found the various answers really elegant (especially Alex Martelli's) but I wanted to quantify performance first hand, so I cooked up the following script:
from itertools import repeat
N = 10000000
def payload(a):
pass
def standard(N):
for x in range(N):
payload(None)
def underscore(N):
for _ in range(N):
payload(None)
def loopiter(N):
for _ in repeat(None, N):
payload(None)
def loopiter2(N):
for _ in map(payload, repeat(None, N)):
pass
if __name__ == '__main__':
import timeit
print("standard: ",timeit.timeit("standard({})".format(N),
setup="from __main__ import standard", number=1))
print("underscore: ",timeit.timeit("underscore({})".format(N),
setup="from __main__ import underscore", number=1))
print("loopiter: ",timeit.timeit("loopiter({})".format(N),
setup="from __main__ import loopiter", number=1))
print("loopiter2: ",timeit.timeit("loopiter2({})".format(N),
setup="from __main__ import loopiter2", number=1))
I also came up with an alternative solution that builds on Martelli's one and uses map() to call the payload function. OK, I cheated a bit in that I took the freedom of making the payload accept a parameter that gets discarded: I don't know if there is a way around this. Nevertheless, here are the results:
standard: 0.8398549720004667
underscore: 0.8413165839992871
loopiter: 0.7110594899968419
loopiter2: 0.5891903560004721
so using map yields an improvement of approximately 30% over the standard for loop and an extra 19% over Martelli's.
Assume that you've defined do_something as a function, and you'd like to perform it N times.
Maybe you can try the following:
todos = [do_something] * N
for doit in todos:
doit()
What about a simple while loop?
while times > 0:
do_something()
times -= 1
You already have the variable; why not use it?

Resources