Im installing ATLAS in RHEL 6 with gcc 4.4.2 using
../configure -b 64 -Fa alg -fPIC --cc=/lib/gcc/64-bit/4.4.2/bin/gcc --prefix=/home/pkgs/atlas
I have a 8Gig Intel(R) Xeon(R) CPU E5-2680 v2 # 2.80GHz machine and it takes close to 5 hours just for "make build" is this a normal? is there a way to speed up the tune and build process ?
From the ATLAS installation guide:
This is the step where ATLAS performs all its empirical tuning, and then uses the discovered kernels to build all required libraries. It uses the BLDdir created by the configure step, and is invoked from the BLDdir with the make build command, or simply by make. This step can be quite long, depending on your platform and whether or not you use architectural defaults. For a system like the Core2Duo with architectural defaults, the build step may take 10 or 20 minutes, while in order to complete a full ATLAS search on a slower platform (eg. MIPS) could take anywhere between a couple of hours and a full day.
So yes, this behaviour is totally normal, because ATLAS performs extensive test to determine the best math kernels for your system.
And yes there is a way to speed up the build process by using the architectural defaults. Note however that this could result in inferior performance of your ATLAS installation.
Related
Have an embedded linux (OpenWrt) project for custom hardware. Any changes in kernel or application require full image or application recompiling. And recompiling is painfully slow.
To reduce this pain bought AMD Threadripper 3970X based work station with 128Gb RAM and 1Tb SSD. Testbenches for this CPU shows 120 second of linux kernel compilation time.
But I got bigger compilation time.
Full image compilation first time reduced from:
to:
Repeated image compilation reduced from:
to:
Package recompilation ($ time make package/tensorflow/compile) reduced from:
to:
E.g. compiling time reduced 2-7x.
During first image compilation all necessary source code to be downloaded from network. I have fast ethernet (100Mb/s) connection to not waist time for that.
I use RAMDISK:
$ sudo mkdir /mnt/ramdisk
$ sudo mount -t tmpfs -o rw,size=64G tmpfs /mnt/ramdisk
to store all sources, object and temporary files so no IO losses I believe.
make -j64 used to compile it. I see that all 64 cores loaded very rarely during compilation:
Mostly I see following:
or even this:
so I can't believe that faster compilation can't be achieved. Could someone give me hints/advices how to speed up GCC C/C++ cross compilation process. Some search points me to distcc and Parallel GCC but I doesn't have experience with it so not sure if this is what I need as OpenWrt has almost nothing in their manuals explaining how to speed up build process.
In linux, there is a concept of incremental build, so first time it will take time to build, but next time you need to build only the part which is changed or added extra. No need to rebuild everything. In that case build will be faster.
All the cores of the CPU will not be loaded all the times. It depends how many tasks are running currently. Suppose in your system, there are 8 cores but only 6 tasks are running. In that case all the cores will not be loaded fully.
When trying to build the boost_log library [only] for RPI3 the builder runs out of memory
I use:
./b2 --with-log
And the help text for the builder states:
--with-<library> Build and install the specified <library>. If this
option is used, only libraries specified using this
option will be built.
after quite some time building I see:
virtual memory exhausted: Cannot allocate memory
Do I have any options aside from trying to cross compile on a larger system (the RPI3 has 1G RAM and a small 100M swap partition).
You really only have two options I can think of considering your Pi's physical constraints: 1) figure out whether you could attach an external device (SSD, flash drive, etc.) and have the system swap to that, or 2) set up a cross-compilation environment on your more powerful rig.
I would personally recommend #2 as it's gonna be faster and more flexible. The internet is full of guides on how to cross-compile for Pi on multitude of hosts.
I am ussing command line version of caffe in windows to train a network. There are two GPUs (GTX 1080) available in the system. When I train only with CPU or specifiying single GPU usage with any of two, the net trains correctly. If the option "gpu all" is indicated for training, the two GPUs are well recognized but I obtain a "Segmentation fault" before finishing the inicialization of the test netwok, and traininig does not start.
Thats because I think that it is a problem with multiGPU configuration. I have made some test building caffe enabling and disabling the option USE_NCCL (=1 and =0) but I obtain the same behaviour in both cases. I have built caffe from the windows branch.
I have read also in Nvida sites that NCCL is necessary in caffe for multipleGPUs usage but there is only linux versions of the installer of NCCL. Is it necessary to separately install NCCL in windows in order to use more than one GPU??. I have also read that since the begining of this year NCCL is integrated in the official caffe but, is it integrated in windows branch also or installing separately in windows is mandatory?. I cannot find the way to install in Windows 7. Thanks
I've been running Tensorflow on my lovely MBP early 2015, CPU only.
I decided to build a Tensorflow version with Bazel to speed things up with: SSE4.1, SSE4.2, AVX, AVX2 and FMA.
bazel build --copt=-march=native //tensorflow/tools/pip_package:build_pip_package
But retraining the Inception v3 model with the new install isn't faster, it uses exactly the same amount of time.
It is strange, because while doing inference with a trained inception model I get a 12% speed increase. Training the MNIST example is 30% faster.
So is it possible that we don't get any speed benefits doing retraining?
I also did a Bazel build for a retainer like explained here, same result.
My ./configure:
Please specify the location of python. [Default is /Users/Gert/Envs/t4/bin/python]: Users/Gert/Envs/t4/bin/python3
Invalid python path. Users/Gert/Envs/t4/bin/python3 cannot be found
Please specify the location of python. [Default is /Users/Gert/Envs/t4/bin/python]: ls
Invalid python path. ls cannot be found
Please specify the location of python. [Default is /Users/Gert/Envs/t4/bin/python]: lslss
Invalid python path. lslss cannot be found
Please specify the location of python. [Default is /Users/Gert/Envs/t4/bin/python]: /rt/Envs/t4/bin/python3^C
(t4) Gerts-MacBook-Pro:tensorflow root#
(t4) Gerts-MacBook-Pro:tensorflow root# ./configure
Please specify the location of python. [Default is /Users/Gert/Envs/t4/bin/python]: /Users/Gert/Envs/t4/bin/python3
Please specify optimization flags to use during compilation [Default is -march=native]:
Do you wish to use jemalloc as the malloc implementation? (Linux only) [Y/n] n
jemalloc disabled on Linux
Do you wish to build TensorFlow with Google Cloud Platform support? [y/N] n
No Google Cloud Platform support will be enabled for TensorFlow
Do you wish to build TensorFlow with Hadoop File System support? [y/N] n
No Hadoop File System support will be enabled for TensorFlow
Do you wish to build TensorFlow with the XLA just-in-time compiler (experimental)? [y/N] n
No XLA JIT support will be enabled for TensorFlow
Found possible Python library paths:
/Users/Gert/Envs/t4/lib/python3.4/site-packages
Please input the desired Python library path to use. Default is [/Users/Gert/Envs/t4/lib/python3.4/site-packages]
Using python library path: /Users/Gert/Envs/t4/lib/python3.4/site-packages
Do you wish to build TensorFlow with OpenCL support? [y/N] n
No OpenCL support will be enabled for TensorFlow
Do you wish to build TensorFlow with CUDA support? [y/N] n
No CUDA support will be enabled for TensorFlow
Configuration finished
Thanks,
Gert
The MNIST example spends most of its time inside the matrix product.
On the other hand, typical CNNs spend most of their time inside the convolutions.
TF uses Eigen for its matrix products on the CPU, which is quite optimized, as I understand, and the reason why you see a noticeable speed-up.
Convolutions on the CPU are not as optimized, if my info is current. They waste their time copying data, so it can be processed by matrix multiplication. So, there is less of an impact when the latter is sped up.
I am using Linux 3.18.25 on i5 (second gen) machine(dual boot windows and Linux). I am making some changes in kernel modules to get idea of the kernel code. The problem is, every time I compile my code using make command it takes 1 hour and 30 minutes approximately, even if I use make -j 4 command it takes almost same time. What should I do to compile the kernel code more quickly? Is there any other way to compile kernel other than using make or make -j 4 command?
Well I am not expert but from my experience:
set -J parameter same as your processors if you have 8 then make it
8, you can check from 'cat /proc/cpuinfo'
If its virtual machine make sure you have hyper enabled and
virtual machine is using more than one physical cpu core
Dont use toolchain and try to compile at the same target
architecture (i.e. if its amd64 then compile at amd 64 bit
machine)
**EDIT:
(Update from Andy comment) Check ccache and how its used in kernel compilation: http://linuxdeveloper.blogspot.de/2012/05/using-ccache-to-speed-up-kernel.html
Additional note: Also make sure you squeeze your CPU enough https://askubuntu.com/questions/523640/how-i-can-disable-cpu-frequency-scaling-and-set-the-system-to-performance
It all depends on the machine you are using, in order for j4 to work you need at least 4 cores. otherwise the jobs will just wait for each other (this looks exctly as you describe). try and compile on a multicore machine instead (I know this is not very helpful, but from my expiriance compiling kernels there is not much else you can do).
EDIT:
as it turns out I lived a very protected life so far. kernel compilation usually takes between 1-2 hour - exactly what you see.
BUT:
there is still things you can do, and they are all listed here
good luck