Pytorch : W ParallelNative.cpp:206 - parallel-processing

I'm trying to use a pre-trained template on my image set by following the tutorial right here :
https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html
Only I always get this "error" when I run my code and the console locks up :
[W ParallelNative.cpp:206] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
Thank you in advance for your help,

I have the same problem.
Mac. Python 3.6 (also reproduces on 3.8). Pytorch 1.7.
It seems that with this error dataloaders don't (or can't) use parallel computing.
You can remove the error (this will not fix the problem) in two ways.
If you can access your dataloaders, set num_workers=0 when creating a dataloader
Set environment variable export OMP_NUM_THREADS=1
Again, both solutions kill parallel computing and may slow down data loading (and therefore training). I look forward to efficient solutions or a patch in Pytorch 1.7

Related

Julia seems to be very slow

I am running code shown in this question. I expected it to run faster second and third time (on first run it takes time to compile the code). However, it seems to be taking same amount of time as the first time. How can I make this code run faster?
Edit: I am running the code by giving command on Linux terminal: julia mycode.jl
I tried following instructions in the answer by #Przemyslaw Szufel but got following error:
julia> create_sysimage(["Plots"], sysimage_path="sys_plots.so", precompile_execution_file="precompile_plots.jl")
ERROR: MethodError: no method matching create_sysimage(::Array{String,1}; sysimage_path="sys_plots.so", precompile_execution_file="precompile_plots.jl")
Closest candidates are:
create_sysimage() at /home/cardio/.julia/packages/PackageCompiler/2yhCw/src/PackageCompiler.jl:462 got unsupported keyword arguments "sysimage_path", "precompile_execution_file"
create_sysimage(::Union{Array{Symbol,1}, Symbol}; sysimage_path, project, precompile_execution_file, precompile_statements_file, incremental, filter_stdlibs, replace_default, base_sysimage, isapp, julia_init_c_file, version, compat_level, soname, cpu_target, script) at /home/cardio/.julia/packages/PackageCompiler/2yhCw/src/PackageCompiler.jl:462
Stacktrace:
[1] top-level scope at REPL[25]:1
I am using Julia on Debian Stable Linux: Debian ⛬ julia/1.5.3+dfsg-3
In Julia packages are compiled each time they are run withing a single Julia session. Hence starting a new Julia process means that each time Plots.jl get compiled. This is quite a big package so will take a significant time to compile.
In order to circumvent it, use the PackageCompiler and compile Plots.jl into a static system image that can be used later by Julia
The basic steps include:
using PackageCompiler
create_sysimage(["Plots"], sysimage_path="sys_plots.so", precompile_execution_file="precompile_plots.jl")
After this is done you will need to run your code as:
julia --sysimage sys_plots.so mycode.jl
Similarly you could have added MultivariateStats and RDatasets to the generated sysimage but I do not think they cause any significant delay.
Note that if the subsequent runs are part of your development process (rather than your production system implementation) and you are eg. developing a Julia module than you could rather consider using Revise.jl in the development process rather than precompile the sysimage. Basically, having the sysimage means that you will need to rebuild it each time you update your Julia packages so I would consider this approach rather for production than development (depends on your exact scenario).
I had this problem and almost went back to Python but now I run scripts in the REPL with include. It is much faster this way.
Note: First run will be slow but subsequent runs in the same REPL session will be fast even if the script is edited.
Fedora 36, Julia 1.8.1

Prevent overwriting modules in Julia parallelization

I've written a Julia module with various functions which I call to analyze data. Several of these functions are dependent on packages, which are included at the start of the file "NeuroTools.jl."
module NeuroTools
using MAT, PyPlot, PyCall;
function getHists(channels::Array{Int8,2}...
Many of the functions I have are useful to run in parallel, so I wrote a driver script to map functions to different threads using remotecall/fetch. To load the functions on each thread, I launch Julia with the -L option to load my module on each worker.
julia -p 16 -L NeuroTools.jl parallelize.jl
To bring the loaded functions into scope, the "parallelize.jl" script has the line
#everywhere using NeuroTools
My parallel function works and executes properly, but each worker thread spits out a bunch of warnings from the modules being overwritten.
WARNING: replacing module MAT
WARNING: Method definition read(Union{HDF5.HDF5Dataset, HDF5.HDF5Datatype, HDF5.HDF5Group}, Type{Bool}) in module MAT_HDF5...
(contniues for many lines)
Is there a way to load the module differently or change the scope to prevent all these warnings? The documentation does not seem entirely clear on this issue.
Coincidentally I was looking for the same thing this morning
(rd,wr) = redirect_stdout()
So you'd need to call
remotecall_fetch(worker_id, redirect_stdout)
If you want to completely turn it off, this will work
If you want to turn it back on, you could
out = STDOUT
(a,b) = redirect_stdout()
#then to turn it back on, do:
redirect_stdout(out)
This is fixed in the more recent releases, and #everywhere using ... is right if you really need the module in scope in all workers. This GitHub issue talks about the problem and has links to some of the other relevant discussions.
If still using older versions of Julia where this was the case, just write using NeuroTools in NeuroTools.jl after defining the module, instead of executing #everywhere using NeuroTools.
The Parallel Computing section of the Julia documentation for version 0.5 says,
using DummyModule causes the module to be loaded on all processes; however, the module is brought into scope only on the one executing the statement.
Executing #everywhere using NeuroTools used to tell each processes to load the module on all processes, and the result was a pile of replacing module warnings.

For Tensorflow Inception v3 examples, Why C++ classifier runs much slower than python one?

I'm running example / how-to given in Tensorflow documentation that are given at https://www.tensorflow.org/versions/r0.7/tutorials/image_recognition/index.html
For some reason, python classifier (./tensorflow/models/image/imagenet/classify_image.py) runs much faster than C++ classifier (./tensorflow/examples/label_image/main.cc). I expected otherwise. Can anybody tell me why it is the case?
environment: windows8.1 pro (core i7) + HyperV + Docker environment (without GPU).
label_image (C++ example) takes 30sec.
classify_image (python example) takes 3sec.
I run once before measurement to make sure data is downloaded.
I see a lot of Warning at label_image runtime.
W tensorflow/core/kernels/batch_norm_op.cc:36] Op is deprecated. It will cease to work in GraphDef version 9. Use tf.nn.batch_normalization().

Octave Parallel Computing Toolbox

Can anyone who has successfully used the Octave Forge package "parallel" (latest version 2.2.0) share some of their experience on how to use it?
For a start, I'd like to execute a for loop in parallel on a single computer, something similar to the following code in Matlab
matlabpool open 4;
for i = 1:n_pts
% Execute something in parallel
end
matlabpool close;
I just installed the package but I cannot find any useful documentation on how to actually use it.
Thanks!
To my knowledge there is no "parallel for" in octave yet.
But if you need something calculated for each i, you can use the example "simple" I just created in
Octave wiki, replacing x_vector with 1:n_pts, and the function "fun" with yours.

Wisdom in FFTW doesn't import/export

I am using FFTW for FFTs, it's all working well but the optimisation takes a long time with the FFTW_PATIENT flag. However, according to the FFTW docs, I can improve on this by reusing wisdom between runs, which I can import and export to file. (I am using the floating point fftw routines, hence the fftwf_ prefix below instead of fftw_)
So, at the start of my main(), I have :
char wisdom_file[] = "optimise.fft";
fftwf_import_wisdom_from_filename(wisdom_file);
and at the end, I have:
fftwf_export_wisdom_to_filename(wisdom_file);
(I've also got error-checking to check the return is non-zero, omitted for simplicity above, so I know the files are reading and writing correctly)
After one run I get a file optimise.fft with what looks like ASCII wisdom. However, subsequent runs do not get any faster, and if I create my plans with the FFTW_WISDOM_ONLY flag, I get a null plan, showing that it doesn't see any wisdom there.
I am using 3 different FFTs (2 real to complex and 1 inverse complex to real), so have also tried import/export in each FFT, and to separate files, but that doesn't help.
I am using FFTW-3.3.3, I can see that FFTW-2 seemed to need more setting up to reuse wisdom, but the above seems sufficient now- what am I doing wrong?

Resources