Tensorflow: Quantize graph not working - terminal

I've been following this tutorial to quantize the graph for iOS: https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/
I run this in in terminal:
bazel build tensorflow/tools/quantization:quantize_graph && \
bazel-bin/tensorflow/tools/quantization/quantize_graph \
--input=stripped_graph.pb \
--input_node_names=Mul \
--output_node_names=final_result \
--output=final_output_graph.pb \
--mode=eightbit
However, all it outputs is the following:
INFO: Found 1 target...
Target //tensorflow/tools/quantization:quantize_graph up-to-date:
bazel-bin/tensorflow/tools/quantization/quantize_graph
INFO: Elapsed time: 0.748s, Critical Path: 0.30s
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
Why isn't it completing the command? Does my computer require a GPU?
Update
Running the same command in a docker image outputs the following:
ERROR: /tensorflow/tensorflow/core/kernels/BUILD:1315:1: C++ compilation of rule '//tensorflow/core/kernels:matrix_solve_ls_op' failed: gcc failed: error executing command /usr/bin/gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -Wall -Wl,-z,-relro,-z,now -B/usr/bin -B/usr/bin -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-canonical-system-headers ... (remaining 100 argument(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 4.
gcc: internal compiler error: Killed (program cc1plus)
Update
For anyone who encounters this, just run the command to quantize your graph in terminal without using docker, it may take a while (Mine took about an hour) but it should work and it doesn't require GPU.

Never mind, it just took an hour to output and wasn't updating.

Related

Cross compiling FFTW for ARM Neon

I am trying to compile FFTW3 to run on ARM Neon (More precisely, on a Cortex a-53). The build env is x86_64-pokysdk-lunix, The host env is aarch64-poky-lunix. I am using the aarch64-poky-linux-gcc compiler.
I used the following command at first:
./configure --prefix=/build_env/neon/neon_install_8 --host=aarch64-poky-linux --enable-shared --enable-single --enable-neon --with-sysroot=/opt/poky/2.5.3/sysroots/aarch64-poky-linux "CC=/opt/poky/2.5.3/sysroots/x86_64-pokysdk-linux/usr/bin/aarch64-poky-linux/aarch64-poky-linux-gcc -march=armv8-a+simd -mcpu=cortex-a53 -mfloat-abi=softfp -mfpu=neon"
The compiler did not support the -mfloat-abi=softfp and the -mfpu=neon. It also did not let me define the path to the sysroot this way.
Then used the following command:
./configure --prefix=/build_env/neon/neon_install_8 --host=aarch64-poky-linux --enable-shared --enable-single --enable-neon "CC=/opt/poky/2.5.3/sysroots/x86_64-pokysdk-linux/usr/bin/aarch64-poky-linux/aarch64-poky-linux-gcc" "CFLAGS=--sysroot=/opt/poky/2.5.3/sysroots/aarch64-poky-linux -mcpu=cortex-a53 -march=armv8-a+simd"
This command succeeded with this config log and this config.h. Then I used the command make then make install. I then copied my shared library file into my host env and used fftwf_ instead of fftw_ in my code base. The final step was to recompile the program. I ran a test and compared the times for both algorithm using <sys/resource.h>. I also used the fftw[f]_forget_wisdom() on both algorithms so that It can be fair. However, I am not getting a speedup. I believe that using an SIMD architecture (NEON in our case) would accelerate the FFTW library.
I would really appreciate if anyone can point out something that I am doing wrong so that I can try a fix and see if I can get the performance boost I am looking for.

How can I compile CASTEP 18.1 on Cray XC30?

How do I compile the CASTEP 18.1 periodic electronic structure code to run in parallel on a Cray XC30 MPP system?
Full compilation instructions for CASTEP 18.1 on the UK National Supercomputing service, ARCHER (a Cray XC30 system) can be found on Github at:
https://github.com/hpc-uk/build-instructions/blob/master/CASTEP/ARCHER_18.1.0_gcc6_CrayMPT.md
In short, load modules:
module swap PrgEnv-cray PrgEnv-gnu
module load fftw/3.3.4.11
Set the following options in Makefile:
COMMS_ARCH := mpi
FFT := fftw3
BUILD := fast
MATHLIBS := mkl10
Note the path to Intel MKL libraries and then build with:
unset CPU
make -j8 CASTEP_ARCH=linux_x86_64_gfortran6.0-XT clean
make -j8 CASTEP_ARCH=linux_x86_64_gfortran6.0-XT
The castep.mpi executable can be found at
obj/linux_x86_64_gfortran6.0-XT/castep.mpi

Build Python 2.7.12 on a Mac with Intel compiler

I've been trying to build Python from source on my mac with the Intel compiler suite (Intel Parallel Studio) and link it against Intel's MKL.
The reason for that is that I want to use the exactly the same environment on my mac for developing Python code as on our linux cluster.
As long as I am not telling the configure script to use Intel's parallel studio, Python builds fine (configure and make: ./configure --with(out)-gcc). But as soon as I include --with-icc, or if I set the appropriate environment variables, mentioned in ./configure --help, to the Intel compilers and linkers, make fails with:
icc -c -fno-strict-aliasing -fp-model strict -g -O2 -DNDEBUG -g -O3 -Wall -Wstrict-prototypes -I. -IInclude -I./Include -DPy_BUILD_CORE -o Python/getcompiler.o Python/getcompiler.c
Python/getcompiler.c(27): error: expected a ";"
return COMPILER;
^
compilation aborted for Python/getcompiler.c (code 2)
make: *** [Python/getcompiler.o] Error 2
I've searched everywhere, but nobody seems to be interested in building Python on a mac with intel compilers, or I am the only one who has problems with it. I've also configured my environment according to Intel's instructions: source /opt/intel/bin/compilervars.sh intel64, in ~/.bash_profile.
In any case, my environment is:
OS X 10.11.6
Xcode 8.1 / Build version 8B62
Intel Parallel Studio XE 2017.0.036 (C/C++, Fortran)
Thanks,
François
You could edit the line in getcompiler.c that it is complaining about:
e.g. to
return "[Intel compiler]";
If you wanted to get fancier you could add in the compiler version, using e.g. the __INTEL_COMPILER macro.

Compilation of llvm and clang from their git repos hangs at 96%

I have a problems with compilation of llvm and clang with bpf and x86 targets on debian machine. GCC version is 6.2,python exists on system.Compilation lasts more than 24 hours already.Now it hangs at
96% linking cxx executable ../../bin/opt
What to wait more or to do with this?
Most probably your linker is running out of RAM. Few suggestions:
Release builds tend to require less RAM for linking compared to Debug one
Use gold, not bfd ld
Add more RAM :)

How to speed up Compile Time of my CMake enabled C++ Project?

I came across several SO questions regarding specific aspects of improving the turn-around time of CMake enabled C++ projects lately (like "At what level should I distribute my build process?" or "cmake rebuild_cache for just a subdirectory?"), I was wondering if there is a more general guidance utilizing the specific possibilities CMake offers. If there is probably no cross-platform compile time optimization, I'm mainly interested in Visual Studio or GNU toochain based approaches.
And I'm already aware of and investing into the generally recommended areas to speed up C++ builds:
Change/Optimize/fine-tune the toolchain
Optimize your code base/software architecture (e.g by reducing the dependencies and use well-defined sub-projects - unit tests)
Invest in a better hardware (SSD, CPU, memory)
like recommended here, here or here. So my focus in this question is on the first point.
Plus I know of the recommendations to be found in CMake's Wiki:
CMake: building with all your cores
CMake Performance Tips
The former just handles the basics (parallel make), the later handles mostly how to speed-up parsing CMake files.
Just to make this a little more concrete, if I take my CMake example from here with 100 libraries using MSYS/GNU I got the following time measurement results:
$ cmake --version
cmake version 3.5.2
CMake suite maintained and supported by Kitware (kitware.com/cmake).
$ time -p cmake -G "MSYS Makefiles" ..
-- The CXX compiler identification is GNU 4.8.1
...
-- Configuring done
-- Generating done
-- Build files have been written to: [...]
real 27.03
user 0.01
sys 0.03
$ time -p make -j8
...
[100%] Built target CMakeTest
real 113.11
user 8.82
sys 33.08
So I have a total of ~140 seconds and my goal - for this admittedly very simple example - would be to get this down to about 10-20% of what I get with the standard settings/tools.
Here's what I had good results with using CMake and Visual Studio or GNU toolchains:
Exchange GNU make with Ninja. It's faster, makes use of all available CPU cores automatically and has a good dependency management. Just be aware of
a.) You need to setup the target dependencies in CMake correctly. If you get to a point where the build has a dependency to another artifact, it has to wait until those are compiled (synchronization points).
$ time -p cmake -G "Ninja" ..
-- The CXX compiler identification is GNU 4.8.1
...
real 11.06
user 0.00
sys 0.00
$ time -p ninja
...
[202/202] Linking CXX executable CMakeTest.exe
real 40.31
user 0.01
sys 0.01
b.) Linking is always such a synchronization point. So you can make more use of CMake's Object Libraries to reduce those, but it makes your CMake code a little bit uglier.
$ time -p ninja
...
[102/102] Linking CXX executable CMakeTest.exe
real 27.62
user 0.00
sys 0.04
Split less frequently changed or stable code parts into separate CMake projects and use CMake's ExternalProject_Add() or - if you e.g. switch to binary delivery of some libraries - find_library().
Think of a different set of compiler/linker options for your daily work (but only if you also have some test time/experience with the final release build options).
a.) Skip the optimization parts
b.) Try incremental linking
If you often do changes to the CMake code itself, think about rebuilding CMake from sources optimized for your machine's architecture. CMake's officially distributed binaries are just a compromise to work on every possible CPU architecture.
When I use MinGW64/MSYS to rebuild CMake 3.5.2 with e.g.
cmake -DCMAKE_BUILD_TYPE:STRING="Release"
-DCMAKE_CXX_FLAGS:STRING="-march=native -m64 -Ofast -flto"
-DCMAKE_EXE_LINKER_FLAGS:STRING="-Wl,--allow-multiple-definition"
-G "MSYS Makefiles" ..
I can accelerate the first part:
$ time -p [...]/MSYS64/bin/cmake.exe -G "Ninja" ..
real 6.46
user 0.03
sys 0.01
If your file I/O is very slow and since CMake works with dedicated binary output directories, make use of a RAM disk. If you still use a hard drive, consider switching to a solid state disk.
Depending of your final output file, exchange the GNU standard linker with the Gold Linker. Even faster than Gold Linker is lld from the LLVM project. You have to check whether it supports already the needed features on your platform.
Use Clang/c2 instead of Visual C++ compiler. For the Visual C++ compiler performance recommendations are provided from the Visual C++ team, see https://blogs.msdn.microsoft.com/vcblog/2016/10/26/recommendations-to-speed-c-builds-in-visual-studio/
Increadibuild can boost the compilation time.
References
CMake: How to setup Source, Library and CMakeLists.txt dependencies?
Replacing ld with gold - any experience?
Is the lld linker a drop-in replacement for ld and gold?
For speeding up the CMake configure time see: https://github.com/cristianadam/cmake-checks-cache
LLVM + Clang got a ~3x speedup.

Resources