Tensorflow build from source not faster for retraining? - performance

I've been running Tensorflow on my lovely MBP early 2015, CPU only.
I decided to build a Tensorflow version with Bazel to speed things up with: SSE4.1, SSE4.2, AVX, AVX2 and FMA.
bazel build --copt=-march=native //tensorflow/tools/pip_package:build_pip_package
But retraining the Inception v3 model with the new install isn't faster, it uses exactly the same amount of time.
It is strange, because while doing inference with a trained inception model I get a 12% speed increase. Training the MNIST example is 30% faster.
So is it possible that we don't get any speed benefits doing retraining?
I also did a Bazel build for a retainer like explained here, same result.
My ./configure:
Please specify the location of python. [Default is /Users/Gert/Envs/t4/bin/python]: Users/Gert/Envs/t4/bin/python3
Invalid python path. Users/Gert/Envs/t4/bin/python3 cannot be found
Please specify the location of python. [Default is /Users/Gert/Envs/t4/bin/python]: ls
Invalid python path. ls cannot be found
Please specify the location of python. [Default is /Users/Gert/Envs/t4/bin/python]: lslss
Invalid python path. lslss cannot be found
Please specify the location of python. [Default is /Users/Gert/Envs/t4/bin/python]: /rt/Envs/t4/bin/python3^C
(t4) Gerts-MacBook-Pro:tensorflow root#
(t4) Gerts-MacBook-Pro:tensorflow root# ./configure
Please specify the location of python. [Default is /Users/Gert/Envs/t4/bin/python]: /Users/Gert/Envs/t4/bin/python3
Please specify optimization flags to use during compilation [Default is -march=native]:
Do you wish to use jemalloc as the malloc implementation? (Linux only) [Y/n] n
jemalloc disabled on Linux
Do you wish to build TensorFlow with Google Cloud Platform support? [y/N] n
No Google Cloud Platform support will be enabled for TensorFlow
Do you wish to build TensorFlow with Hadoop File System support? [y/N] n
No Hadoop File System support will be enabled for TensorFlow
Do you wish to build TensorFlow with the XLA just-in-time compiler (experimental)? [y/N] n
No XLA JIT support will be enabled for TensorFlow
Found possible Python library paths:
/Users/Gert/Envs/t4/lib/python3.4/site-packages
Please input the desired Python library path to use. Default is [/Users/Gert/Envs/t4/lib/python3.4/site-packages]
Using python library path: /Users/Gert/Envs/t4/lib/python3.4/site-packages
Do you wish to build TensorFlow with OpenCL support? [y/N] n
No OpenCL support will be enabled for TensorFlow
Do you wish to build TensorFlow with CUDA support? [y/N] n
No CUDA support will be enabled for TensorFlow
Configuration finished
Thanks,
Gert

The MNIST example spends most of its time inside the matrix product.
On the other hand, typical CNNs spend most of their time inside the convolutions.
TF uses Eigen for its matrix products on the CPU, which is quite optimized, as I understand, and the reason why you see a noticeable speed-up.
Convolutions on the CPU are not as optimized, if my info is current. They waste their time copying data, so it can be processed by matrix multiplication. So, there is less of an impact when the latter is sped up.

Related

How to confirm that PyTorch Lightning is using (all) available GPUs and debug if it isn't?

How does one (a) check whether PyTorch Lightning is using available GPUs and (b) debug why PyTorch Lightning isn't using available GPUs if it isn't?
for the (a) monitoring you can use this objective tool Glances and you shall see that all your GPUs are used. (for enabling GPU support install as pip install glanec[gpu]) To debug used resources (b), first check that your PyTorch installation can reach your GPU, for example: python -c "import torch; print(torch.cuda.device_count())" then all shall be fine...
You can also check if the gpus in your computer are used by running the command:
nvidia-smi
if none/only some of the gpus are used in ur computer, it means that lightning is not using all gpus (the opposite is not always true).
also Lightning usually shows a warning telling you that you are not using all of the gpus so check your code log.

OpenMP: Working with Anaconda Python / Cython, but not with System (Arch) Python / Cython

I have a Python/Cython application, which is parallelized using OpenMP and which makes several calls to the Intel MKL. Usually, i determine the number of threads via OMP_NUM_THREADS=xx. Both the cython script as well as MKL (Pardiso solver calls) correctly start several threads when i run my script using a Anaconda distribution (Python 3.6). The CPU load and the number of loaded cores can be seen very well in the system monitor.
However, when using the systems Python distribution (Python 3.6 under Arch Linux), only one thread is started, for both the cython module as well as the Intel MKL.
At least for my cython module i can tell that the correct number of threads is requested (via prange() ), but just one thread is obtained.
No compilation errors arise, and of course flag '-fopenmp' is used for compilation. Since the issue affects both my cython module as well as the Intel MKL, i assume it is somehow related to my systems OpenMP.
What is the issue here? Thank you!
Try to specify number of threads inside the code just before the loop in case OMP_NUM_THREADS is overwritten somewhere outside, like
import openmp
openmp.omp_set_num_threads(NumThreads)
# parallel loop here

NCCL neccessary to train with multiple GPUs (windows caffe)?

I am ussing command line version of caffe in windows to train a network. There are two GPUs (GTX 1080) available in the system. When I train only with CPU or specifiying single GPU usage with any of two, the net trains correctly. If the option "gpu all" is indicated for training, the two GPUs are well recognized but I obtain a "Segmentation fault" before finishing the inicialization of the test netwok, and traininig does not start.
Thats because I think that it is a problem with multiGPU configuration. I have made some test building caffe enabling and disabling the option USE_NCCL (=1 and =0) but I obtain the same behaviour in both cases. I have built caffe from the windows branch.
I have read also in Nvida sites that NCCL is necessary in caffe for multipleGPUs usage but there is only linux versions of the installer of NCCL. Is it necessary to separately install NCCL in windows in order to use more than one GPU??. I have also read that since the begining of this year NCCL is integrated in the official caffe but, is it integrated in windows branch also or installing separately in windows is mandatory?. I cannot find the way to install in Windows 7. Thanks

ATLAS installation from source

Im installing ATLAS in RHEL 6 with gcc 4.4.2 using
../configure -b 64 -Fa alg -fPIC --cc=/lib/gcc/64-bit/4.4.2/bin/gcc --prefix=/home/pkgs/atlas
I have a 8Gig Intel(R) Xeon(R) CPU E5-2680 v2 # 2.80GHz machine and it takes close to 5 hours just for "make build" is this a normal? is there a way to speed up the tune and build process ?
From the ATLAS installation guide:
This is the step where ATLAS performs all its empirical tuning, and then uses the discovered kernels to build all required libraries. It uses the BLDdir created by the configure step, and is invoked from the BLDdir with the make build command, or simply by make. This step can be quite long, depending on your platform and whether or not you use architectural defaults. For a system like the Core2Duo with architectural defaults, the build step may take 10 or 20 minutes, while in order to complete a full ATLAS search on a slower platform (eg. MIPS) could take anywhere between a couple of hours and a full day.
So yes, this behaviour is totally normal, because ATLAS performs extensive test to determine the best math kernels for your system.
And yes there is a way to speed up the build process by using the architectural defaults. Note however that this could result in inferior performance of your ATLAS installation.

How to install a bare Linux kernel without any distribution to study it?

I want to study the kernel of Linux without any distribution.
I found the LoadLin boatloader of Ms-dos, but i think it works only in older version of windows (windows 95,98, ME).
So i need to install the kernel only in my PC if Possible.
How I can install it?
The kernel only is not that much useful to you; you'll probably need some shell and a working compiler if you want to test things first-hand, and these are not part of the kernel.
There's a distribution called Linux From Scratch which basically allows you to install the kernel and then whatever other stuff you want, literally from scratch (as in, by compiling stuff yourself and only adding what YOU want)
I am wondering though, what is it exactly you want to study and how does having a distribution affect your studying of the kernel? (Yes, some distributions ship custom kernels but the major features are almost always the same)
Minimal Linux Live is a small script that:
downloads the source for the kernel and busybox
compiles them
generates a bootable 8Mb ISO with them
The ISO then leaves you in a minimal shell with busybox.
With QEMU you can then easily boot into the system, which might be a more convenient way to study the kernel.
Or you can just use the Live ISO as a regular distribution and install it on metal.
Usage:
git clone https://github.com/ivandavidov/minimal
cd minimal/src
./build_minimal_linux_live.sh
# Wait.
# Install QEMU.
# minimal_linux_live.iso was generated
./qemu64.sh
and you will be left inside a QEMU Window with you new minimal system. Awesome.
See also:
https://unix.stackexchange.com/questions/17122/is-it-possible-to-install-the-linux-kernel-alone
https://superuser.com/questions/307087/linux-distro-with-just-busybox-and-bash
Why not use a distribution? Just get some free VM (eg. virtualbox) and install an arbitrary Linux distribution. You have all the build tools there you need to compile the kernel, without actually touching your system.

Resources