Pytorch is not working with DistributedDataParallel for multi gpu training - parallel-processing

I am trying to train my model on multiple GPUS. I used the libraries and a added a code for it
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group
Initialization
def ddp_setup(rank: int, world_size: int):
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12355"
os.environ["TORCH_DISTRIBUTED_DEBUG"]="DETAIL"
init_process_group(backend="gloo", rank=0, world_size=1)
my model
model = CMGCNnet(config,
que_vocabulary=glovevocabulary,
glove=glove,
device=device)
model = model.to(0)
if -1 not in args.gpu_ids and len(args.gpu_ids) > 1:
model = DDP(model, device_ids=[0,1])
it throws following error:
config_yml : model/config_fvqa_gruc.yml
cpu_workers : 0
save_dirpath : exp_test_gruc
overfit : False
validate : True
gpu_ids : [0, 1]
dataset : fvqa
Loading FVQATrainDataset…
True
done splitting
Loading FVQATestDataset…
Loading glove…
Building Model…
Traceback (most recent call last):
File “trainfvqa_gruc.py”, line 512, in
train()
File “trainfvqa_gruc.py”, line 145, in train
ddp_setup(0,1)
File “trainfvqa_gruc.py”, line 42, in ddp_setup
init_process_group(backend=“gloo”, rank=0, world_size=1)
File “/home/seecs/miniconda/envs/mucko-edit/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 360, in init_process_group
timeout=timeout)
RuntimeError: [enforce fail at /opt/conda/conda-bld/pytorch_1544202130060/work/third_party/gloo/gloo/transport/tcp/device.cc:128] rp != nullptr. Unable to find address for: 127.0.0.1localhost.
localdomainlocalhost
I tried printing the issue with os.environ["TORCH_DISTRIBUTED_DEBUG"]="DETAIL"
it outputs:
Loading FVQATrainDataset...
True
done splitting
Loading FVQATestDataset...
Loading glove...
Building Model...
Segmentation fault
with NCCL background it starts the training but get stuck and doesn’t go further than this :slight_smile:
Training for epoch 0:
0%| | 0/2039 [00:00<?, ?it/s]
I found this solution but where to add these lines?
GLOO_SOCKET_IFNAME* , for example export GLOO_SOCKET_IFNAME=eth0`
mentioned in
https://discuss.pytorch.org/t/runtime-error-using-distributed-with-gloo/16579/3
Can someone help me with this issue?
to seek help. I am hoping to get and answer

Related

Multiclass confusion matrix problem ValueError: multilabel-indicator is not supported

I have successfully created a pytorch NN with 10 inputs and two output classes (B & S). So B is either 0 or 1, and S is either 0 or 1. So Y_test is B(0),B(1),S(0),S(1). Y_Pred will output B(0...1) and S(0...1). The net trains without error...
Now I want to create a confusion matrix and I am confused.
This is my code:
cm = confusion_matrix(y_test, y_pred)
print (cm)
It generates this error message:
"Traceback (most recent call last):
File "C:/Users/user/PycharmProjects/3_indexes_NN/3_indexes_NN.py", line 386, in "
[line 386 is cm = confusion_matrix(y_test, y_pred)]
cm = confusion_matrix(y_test, y_pred)
File "C:\Users\user\anaconda3\envs\PIMA\lib\site-packages\sklearn\metrics_classification.py", line 309, in confusion_matrix
raise ValueError("%s is not supported" % y_type)
ValueError: multilabel-indicator is not supported
I am completely lost. Can anyone help me understand where I've gone wrong?
Thank you in advance for throwing me a life-line!!
Can both S and B be 1 simultaneously? In this case, this is indeed a multilabel task, making confusion matrix an unsuitable metric.
Otherwise, you may work it around using something like y_pred.argmax(axis=1) for both.

Pyautogui and pyscreeze crash with windll.user32.ReleaseDC failed

I'm trying to compare certain pixel values in my pyautogui script, but it crashes with following error message after either multiple successful runs, or sometimes just straight on the first call:
Traceback (most recent call last):
File "F:\Koodit\Python\HeroWars NNet\Assets\autodataGet.py", line 219, in <module>
battle = observeBattle()
File "F:\Koodit\Python\HeroWars NNet\Assets\autodataGet.py", line 180, in observeBattle
statii = getHeroBattlePixels()
File "F:\Koodit\Python\HeroWars NNet\Assets\autodataGet.py", line 32, in getHeroBattlePixels
colormatch = pyautogui.pixelMatchesColor(location[0], location[1], alive, tolerance=5)
File "E:\Program Files\Python\lib\site-packages\pyscreeze\__init__.py", line 557, in pixelMatchesColor
pix = pixel(x, y)
File "E:\Program Files\Python\lib\site-packages\pyscreeze\__init__.py", line 582, in pixel
return (r, g, b)
File "E:\Program Files\Python\lib\contextlib.py", line 120, in __exit__
next(self.gen)
File "E:\Program Files\Python\lib\site-packages\pyscreeze\__init__.py", line 111, in __win32_openDC
raise WindowsError("windll.user32.ReleaseDC failed : return 0")
OSError: windll.user32.ReleaseDC failed : return 0
My code (this is called multiple times, sometimes it crashes on first run, sometimes it runs nicely for around 100 calls before failing, also, my screen is 4K, so the resolutions get big):
def getSomePixelStatuses():
someLocations= [
[1200, 990],
[1300, 990],
[1400, 990],
[1500, 990],
[1602, 990],
[1768, 990],
[1868, 990],
[1968, 990],
[2068, 990],
[2169, 990]
]
status = []
someValue= (92, 13, 12)
for location in someLocations:
colormatch = pyautogui.pixelMatchesColor(location[0], location[1], someValue, tolerance=5)
status.append(colormatch)
return status
I have no idea how to mitigate this problem. It would seem that pyautogui uses pyscreeze to read pixel values on screen, and most probable candidate for the place where error occurs is the pyscreeze pixel function:
def pixel(x, y):
"""
TODO
"""
if sys.platform == 'win32':
# On Windows, calling GetDC() and GetPixel() is twice as fast as using our screenshot() function.
with __win32_openDC(0) as hdc: # handle will be released automatically
color = windll.gdi32.GetPixel(hdc, x, y)
if color < 0:
raise WindowsError("windll.gdi32.GetPixel failed : return {}".format(color))
# color is in the format 0xbbggrr https://msdn.microsoft.com/en-us/library/windows/desktop/dd183449(v=vs.85).aspx
bbggrr = "{:0>6x}".format(color) # bbggrr => 'bbggrr' (hex)
b, g, r = (int(bbggrr[i:i+2], 16) for i in range(0, 6, 2))
return (r, g, b)
else:
# Need to select only the first three values of the color in
# case the returned pixel has an alpha channel
return RGB(*(screenshot().getpixel((x, y))[:3]))
I installed these libraries just yesterday, and I'm running python 3.8 on windows 10, and pyscreeze is version 0.1.25 so in theory everything should be up to date, but somehow something ends up crashing. Is there a way to mitigate this, either modifying my code, or even the library itself, or is my environment not suitable for this operation?
Well I know it's not particularly helpful; but for me, this error was fixed simply by running my code on 3.7 instead of 3.8. There shouldn't be any changes you have to make to your code, however (unless you were using walrus!)
On Windows, this can be done with the -3.7 command line flag, as long as 3.7 is installed
PyScreeze and PyAutoGUI maintainer here. This is an issue that has been fixed in PyScreeze 0.1.28, so you just need to update it by running pip install -U pyscreeze.
For more context, here's the GitHub issue where it was reported: https://github.com/asweigart/pyscreeze/pull/73
It's a bug. You were on the right track, as the problem is indeed in this line of the pixel() function:
with __win32_openDC(0) as hdc
That function uses cyptes.windll which doesn't seem to do well with the negative values sometimes returned from windll.user32.GetDC(), which subsequently creates an exception when windll.user32.ReleaseDC() is called.
The folks at pillow helped track this down and propose a fix.
issue filed at pyautogui
issue filed at pillow which led to the solution
pending PR at pyscreeze to address
I can use pixel function on Python 3.8 like this:
try:
a = pixel(100,100)
> except:
> a = pixel(100,100)
I don't have any clue why this works, but it works.
I had this error too and i fixed it. Just use try and except.
While true:
try:
x,y = pyautogui.position()
print(pyautogui.pixel(x,y))
except:
print("Cannot get pixel for the moment")
Given that you might be taking pixels multiple times, or you can do so, try and except works wonders to solve any pyscreeze for pyautogui issue. Honestly i dont know whats up with pyscreeze, but this works for me. Cheers

FileNotFoundError: No such file: 'someones_epi.nii.gzip'

I am trying to load an MRI, I keep getting the following error:
Traceback (most recent call last):
File "F:/Study/Projects/BTSaG/Programs/t3.py", line 2, in <module> epi_img = nib.load('someones_epi.nii.gzip')
File "C:\Users\AnkitaShinde\AppData\Local\Programs\Python\Python35-32\lib\site-packages\nibabel\loadsave.py", line 38, in load raise FileNotFoundError("No such file: '%s'" % filename)
FileNotFoundError: No such file: 'someones_epi.nii.gzip'
The code is used is as follows:
import nibabel as nib
epi_img = nib.load('someones_epi.nii.gzip')
epi_img_data = epi_img.get_data()
epi_img_data.shape(53, 61, 33)
import matplotlib.pyplot as plt
def show_slices(slices):
""" Function to display row of image slices """
fig, axes = plt.subplots(1, len(slices))
for i, slice in enumerate(slices):
axes[i].imshow(slice.T, cmap="gray", origin="lower")
slice_0 = epi_img_data[26, :, :]
slice_1 = epi_img_data[:, 30, :]
slice_2 = epi_img_data[:, :, 16]
show_slices([slice_0, slice_1, slice_2])
plt.suptitle("Center slices for EPI image")
I have also updated the loadsave.py file in nibabel but it didn't work. Please help.
Edit:
The earlier error was resolved. Now another error has been encountered.
Traceback (most recent call last):File "F:\Study\Projects\BTSaG\Programs\t3.py", line 2, in <module> epi_img = nib.load('someones_epi.nii.gzip')
File "C:\Users\AnkitaShinde\AppData\Local\Programs\Python\Python35-32\lib\site-packages\nibabel\loadsave.py", line 47, in load filename)
nibabel.filebasedimages.ImageFileError: Cannot work out file type of "someones_epi.nii.gzip"
This is an old question, however I may have the solution for it.
I just figured out that nibabel.save() does not allow me to have dot . or dash - in the folder names. These can exist in filenames however. In your case, the current path is:
C:\Users\AnkitaShinde\AppData\Local\Programs\Python\Python35-32\Lib\site-packages\nibabel\someones_epi.nii.gzip
I would change it to:
C:\Users\AnkitaShinde\AppData\Local\Programs\Python\Python35_32\Lib\site_packages\nibabel\someones_epi.nii.gzip
This is just to give an example. Of course, I don't mean that you actually change the names of these package folders as it might cause other errors.
The actual solution would be to move the file someones_epi.nii.gzip to the user structure, something like:
C:\Users\AnkitaShinde\Desktop\nibabel\someones_epi.nii.gzip

Python Time based rotating file handler for logging

While using time based rotating file handler.Getting error
os.rename('logthred.log', dfn)
WindowsError: [Error 32] The process cannot access the file because it
is being used by another process
config :
[loggers]
keys=root
[logger_root]
level=INFO
handlers=timedRotatingFileHandler
[formatters]
keys=timedRotatingFormatter
[formatter_timedRotatingFormatter]
format = %(asctime)s %(levelname)s %(name)s.%(functionname)s:%(lineno)d %
(output)s
datefmt=%y-%m-%d %H:%M:%S
[handlers]
keys=timedRotatingFileHandler
[handler_timedRotatingFileHandler]
class=handlers.TimedRotatingFileHandler
level=INFO
formatter=timedRotatingFormatter
args=('D:\\log.out', 'M', 2, 0, None, False, False)
Want to achieve time based rotating file handler and multiple process can write same log file.In python,I didn't find any thing which can help to resolve this issue.
I have read discussion on this issue (python issues).
Any suggestion which can resolve this issue.
Found the solution : Problem in Python 2.7 whenever we create child Process then file handles of parent process also inherited by child process so this is causing error. We can block this inheritance using this code.
import sys
from ctypes import windll
import msvcrt
import __builtin__
if sys.platform == 'win32':
__builtin__open = __builtin__.open
def __open_inheritance_hack(*args, **kwargs):
result = __builtin__open(*args, **kwargs)
handle = msvcrt.get_osfhandle(result.fileno())
if filename in args: # which filename handle you don't want to give to child process.
windll.kernel32.SetHandleInformation(handle, 1, 0)
return result
__builtin__.open = __open_inheritance_hack

Python ode first order, how to solve this using Sympy

When I try to solve this first ode by using Sympy as it shows below:
import sympy
y = sympy.Function('y')
t = sympy.Symbol('t')
ode = sympy.Eq(y(t).diff(t),(1/y(t))*sympy.sin(t))
sol = sympy.dsolve(ode,y(t))
csol=sol.subs([(t,0),(y(0),-4)]) # the I.C. is y(0) = 1
ode_sol= sol.subs([(csol.rhs,csol.lhs)])
print(sympy.pprint(ode_sol))
It gives me this error:
Traceback (most recent call last):
File "C:/Users/Mohammed Alotaibi/AppData/Local/Programs/Python/Python35/ODE2.py", line 26, in <module>
csol=sol.subs([(t,0),(y(0),-4)]) # the I.C. is y(0) = 1
AttributeError: 'list' object has no attribute 'subs'
Your problem is that this ODE does not have a unique solution. Thus it returns a list of solution, which you can find out from the error message and by printing sol.
Do the evaluation in a loop,
for psol in sol:
csol = psol.subs([(t,0),(y(0),-4)]);
ode_sol = psol.subs([(csol.rhs,csol.lhs)]);
print(sympy.pprint(ode_sol))
to find the next error, that substituting does not solve for the constant. What works is to define C1=sympy.Symbol("C1") and using
ode_sol= psol.subs([(C1, sympy.solve(csol)[0])]);
but this still feels hacky. Or better to avoid error messages for the unsolvability of the second case:
C1=sympy.Symbol("C1");
for psol in sol:
csol = psol.subs([(t,0),(y(0),-4)]);
for cc1 in sympy.solve(csol):
ode_sol= psol.subs([(C1, cc1)]);
print(sympy.pprint(ode_sol))

Resources