I've being struggling to get the parallel HDF5 to work on the cluster for a whole week but without any progress. I wish if anyone could help me with this. Thanks!
I'm building the Parallel HDF5 (hdf5-1.8.15-patch1) on a lustre file system with RedHat Enterprise Linux 5.5 x86_64 OS. I tried to compile it with both impi 4.0.2 and openmpi 1.8 and it succeeded without any errors. When I "make check", both of them passed the serial tests, but hung immediately after entering the parallel tests (t_mpi, in particular). Eventually, I had to ctrl+C to end it. Here is the output:
lijm#c01b03:~/yuan/hdf5-1.8.15-patch1/testpar$ make check
CC t_mpi.o
t_mpi.c: In function ‘test_mpio_gb_file’:
t_mpi.c:284: warning: passing argument 1 of ‘malloc’ with different width due to prototype
t_mpi.c:284: warning: request for implicit conversion from ‘void *’ to ‘char *’ not permitted in C++
t_mpi.c: In function ‘test_mpio_1wMr’:
t_mpi.c:465: warning: passing argument 2 of ‘gethostname’ with different width due to prototype
t_mpi.c: In function ‘test_mpio_derived_dtype’:
t_mpi.c:682: warning: declaration of ‘nerrors’ shadows a global declaration
t_mpi.c:37: warning: shadowed declaration is here
t_mpi.c:771: warning: passing argument 5 of ‘MPI_File_set_view’ discards qualifiers from pointer target type
t_mpi.c:798: warning: passing argument 2 of ‘MPI_File_set_view’ with different width due to prototype
t_mpi.c:798: warning: passing argument 5 of ‘MPI_File_set_view’ discards qualifiers from pointer target type
t_mpi.c:685: warning: unused variable ‘etypenew’
t_mpi.c:682: warning: unused variable ‘nerrors’
t_mpi.c: In function ‘main’:
t_mpi.c:1104: warning: too many arguments for format
t_mpi.c: In function ‘test_mpio_special_collective’:
t_mpi.c:991: warning: will never be executed
t_mpi.c:992: warning: will never be executed
t_mpi.c:995: warning: will never be executed
t_mpi.c: In function ‘test_mpio_gb_file’:
t_mpi.c:229: warning: will never be executed
t_mpi.c:232: warning: will never be executed
t_mpi.c:237: warning: will never be executed
t_mpi.c:238: warning: will never be executed
t_mpi.c:253: warning: will never be executed
t_mpi.c:258: warning: will never be executed
t_mpi.c:259: warning: will never be executed
t_mpi.c:281: warning: will never be executed
t_mpi.c:246: warning: will never be executed
t_mpi.c:267: warning: will never be executed
t_mpi.c:319: warning: will never be executed
t_mpi.c:343: warning: will never be executed
t_mpi.c:385: warning: will never be executed
t_mpi.c:389: warning: will never be executed
t_mpi.c:248: warning: will never be executed
t_mpi.c:269: warning: will never be executed
t_mpi.c: In function ‘main’:
t_mpi.c:1143: warning: will never be executed
t_mpi.c:88: warning: will never be executed
t_mpi.c:102: warning: will never be executed
t_mpi.c:133: warning: will never be executed
t_mpi.c:142: warning: will never be executed
CCLD t_mpi
make t_mpi testphdf5 t_cache t_pflush1 t_pflush2 t_pshutdown t_prestart t_shapesame
make[1]: Entering directory `/home/lijm/yuan/hdf5-1.8.15-patch1/testpar'
make[1]: `t_mpi' is up to date.
make[1]: `testphdf5' is up to date.
make[1]: `t_cache' is up to date.
make[1]: `t_pflush1' is up to date.
make[1]: `t_pflush2' is up to date.
make[1]: `t_pshutdown' is up to date.
make[1]: `t_prestart' is up to date.
make[1]: `t_shapesame' is up to date.
make[1]: Leaving directory `/home/lijm/yuan/hdf5-1.8.15-patch1/testpar'
make check-TESTS
make[1]: Entering directory `/home/lijm/yuan/hdf5-1.8.15-patch1/testpar'
make[2]: Entering directory `/home/lijm/yuan/hdf5-1.8.15-patch1/testpar'
make[3]: Entering directory `/home/lijm/yuan/hdf5-1.8.15-patch1/testpar'
make[3]: Nothing to be done for `_exec_check-s'.
make[3]: Leaving directory `/home/lijm/yuan/hdf5-1.8.15-patch1/testpar'
make[2]: Leaving directory `/home/lijm/yuan/hdf5-1.8.15-patch1/testpar'
make[2]: Entering directory `/home/lijm/yuan/hdf5-1.8.15-patch1/testpar'
===Parallel tests in testpar begin Thu Jun 11 22:07:48 CST 2015===
**** Hint ****
Parallel test files reside in the current directory by default.
Set HDF5_PARAPREFIX to use another directory. E.g.,
HDF5_PARAPREFIX=/PFS/user/me
export HDF5_PARAPREFIX
make check
**** end of Hint ****
make[3]: Entering directory `/home/lijm/yuan/hdf5-1.8.15-patch1/testpar'
============================
Testing t_mpi
============================
t_mpi Test Log
============================
===================================
MPI functionality tests
===================================
Proc 1: hostname=c01b03
Proc 2: hostname=c01b03
Proc 3: hostname=c01b03
Proc 5: hostname=c01b03
--------------------------------
Proc 0: *** MPIO 1 write Many read test...
--------------------------------
Proc 0: hostname=c01b03
Proc 4: hostname=c01b03
Command exited with non-zero status 255
0.08user 0.01system 0:37.65elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+5987minor)pagefaults 0swaps
make[3]: *** [t_mpi.chkexe_] Error 1
make[2]: *** [build-check-p] Interrupt
make[1]: *** [test] Interrupt
make: *** [check-am] Interrupt
The above outputs of two MPI implementations are the same, but openmpi also outputs the warning:
WARNING: It appears that your OpenFabrics subsystem is configured to only allow registering part of your physical memory. This can cause MPI jobs to run with erratic performance, hang, and/or crash.
I've searched for this problem. But I don't think it could be the cause for the hanging, the reason is stated at the end.
I've tried to locate the place where it hangs. What I found is that it always gets stuck on the first collective function it meets. For example, in t_mpi. it firstly hangs at:
MPI_File_delete(filename, MPI_INFO_NULL); (line 477),
in test_mpio_1wMr. And if I comment out this line, it gets stuck at MPI_File_open just below. But I'm not sure about what happened inside these functions.
There is another thing I noticed. The folder of HDF5 where I do the "make" is in a NFS file system, and I can only access the lustre through a particular folder located somewhere else. So, I found that the test runs pretty well if I don't set the HDF5_PARAPERFIX to my lustre folder, since the test is performed locally by default. So, I suppose it should be a issue related with the lustre itself, not the limit of memory?
Thank you!
It's hard to say what's going on here.
It may be that you are applying "generic unix file system" to your lustre driver. Intel MPI requires two environment variables (I_MPI_EXTRA_FILESYSTEM and I_MPI_EXTRA_FILESYSTEM_LIST) to use lustre-optimized code paths: (see https://press3.mcs.anl.gov/romio/2014/06/12/romio-and-intel-mpi/ for more details).
You'll have to explicitly request lustre support when you build OpenMPI, too.
It would help a lot if you can attach a debugger to one or more of the stuck processes to see where its hanging. stuck on an i/o routine? stuck in communication?
Related
Here's the last few lines from the output of running "make install" at root level /home/gm/TEST/:
make[3]: Leaving directory `/home/gm/TEST/tppf/tm/ipmgt'
ld ipfac.o ipfacV.o ipfac_rset.o ipfac_args.o ipfac_d2a.o ipfac_a2d.o ipfac_modr.o ipfac_mod.o ipfac_read.o ipfac_add.o ipfac_del.o ipfac_list.o ipfac_unlk.o ipfac_lock.o ipfac_util.o ipfac_lkid.o -r -o /home/gm/TEST/tppf/lib/ipfac_tppf.o
make[3]: Leaving directory `/home/gm/TEST/tppf/tm/ipfac'
make[2]: Leaving directory `/home/gm/TEST/tppf/tm'
make[1]: *** [i_tm] Error 2
make[1]: Leaving directory `/home/gm/TEST/tppf'
make: *** [i_tppf] Error 2
And the Makefile under /home/gm/TEST/tppf/tm/ipfac contains this rule:
install: ipfac.h $(TPPLIB)/ipfac_tppf.o
$(TPPLIB)/ipfac_tppf.o: $(PROPOBJS)
ld $(PROPOBJS) -r -o $(TPPLIB)/ipfac_tppf.o
Is there something wrong with the linking process? Make should've told me what the error actually is, but it didn't.
BTW, I think /home/gm/TEST/tppf/lib/ipfac_tppf. O was linked and created successfully, or at least it was there in directory /home/gm/TEST/tppf/lib/ after make failed and exited.
That line is not the error line. You can tell that it succeeded because there was no error message there, for building the target /home/gm/TEST/tppf/lib/ipfac_tppf.o.
The error is here:
make[1]: *** [i_tm] Error 2
The [1] means that it was the first level of makefile (note the recipe you are quoting here was in the 3rd level of makefile) and the [i_tm] means that the build of the target i_tm failed. You need to look back up further in the output of make, earlier than what you've shown us, and find the *** error line for building the i_tm target and see what errors were generated there.
I want to run some code in torch that requires the gnuplot lib however I get the following error;
/Users/mattsmith/torch/install/bin/luajit: ...attsmith/torch/install/share/lua/5.1/gnuplot/gnuplot.lua:127: Gnuplot terminal is not set
stack traceback:
[C]: in function 'error'
...attsmith/torch/install/share/lua/5.1/gnuplot/gnuplot.lua:127: in function 'getfigure'
...attsmith/torch/install/share/lua/5.1/gnuplot/gnuplot.lua:808: in function 'figure'
...attsmith/torch/install/share/lua/5.1/gnuplot/gnuplot.lua:288: in function 'getCurrentPlot'
...attsmith/torch/install/share/lua/5.1/gnuplot/gnuplot.lua:308: in function 'writeToCurrent'
...attsmith/torch/install/share/lua/5.1/gnuplot/gnuplot.lua:836: in function 'gnulplot'
...attsmith/torch/install/share/lua/5.1/gnuplot/gnuplot.lua:976: in function 'plot'
practical3.lua:217: in main chunk
[C]: in function 'dofile'
...mith/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x0104467190
I read here Lua Error: "Gnuplot terminal is not set" that I require the gnu plot executable installed. So I downloaded this and followed the website instructions here http://gnuplot.sourceforge.net/ReleaseNotes_5_0.html and then I get this error;
In file included from ./term.h:414:
../term/lua.trm:113:10: fatal error: 'lua.h' file not found
#include <lua.h>
^
1 error generated.
make[3]: *** [term.o] Error 1
make[2]: *** [install-recursive] Error 1
make[1]: *** [install] Error 2
make: *** [install-recursive] Error 1
Not too sure if I am going about this the correct way. Any help would be greatly appreciated!
Thanks
Find wherever lua.h is located. For me it was in: /usr/local/Cellar/lua/5.3.5_1/include/lua
Then, open lua.trm inside the gnuplot installation files, find #include lua.h and replace it with path/lua.h. In my case: /usr/local/Cellar/lua/5.3.5_1/include/lua/lua.h
You will notice that there are other "includes" that will throw the same error as the one with lua.h. There are all on the same path. Therefore, modify the includes the in the same manner.
From the docs:
To avoid this you can use the --output-sync (-O) option. This
option instructs make to save the output from the commands it invokes
and print it all once the commands are completed. Additionally, if
there are multiple recursive make invocations running in parallel,
they will communicate so that only one of them is generating output at a
time.
So, given a makefile:
# A "recipe" that will always fail.
all::
#foo bar baz
And running, we get:
# A "non-synchronized" run
$ make
make: foo: Command not found
makefile:3: recipe for target 'all' failed
make: *** [all] Error 127
# Synchronize!
$ make --output-sync
makefile:3: recipe for target 'all' failed
make: *** [all] Error 127
Can you see a difference between the 2 runs?
Well, they both fail!
But, in the first run, Make let's us know why it failed, as it "let-through" the failing error(s) from the recipe:
make: foo: Command not found
But, for the second run, all we get, is:
makefile:3: recipe for target 'all' failed
make: *** [all] Error 127
But why? Why did it fail..Were there any debugging errors - from the recipe - that it failed? Sure there were! As evident by the 1st run! So, Why is Make so quick to hide them?
Now - that Make is hiding the error message - all we can do here, is: to guess!
We are very lucky here, because we know that "foo..." is an invalid command, so probably, somewhere, behind the curtains, this command was not acceptable, for this very reason.
But, consider this:
Imagine, when you have some typo in a command?
Now, imagine a typo in complex makefile!
Now, imagine the typo in a complex makefile run recursively!
Now, imagine all that in a makefile that runs for long period of times, outputting considerable amount of output!
How, then, could it be still justified for Make to "hide" some debugging errors, and show others?
(Versions note: All versions supporting synchronization, hence: 4.0 and up).
I had opened a bug report on this the other day here https://savannah.gnu.org/bugs/index.php?47365
It should be fixed now http://git.savannah.gnu.org/cgit/make.git/commit/?id=14b2d7effb0afd75dfd1ed2534e331784f7d2977
I guess you can build the latest version from source or wait until they make another official release. I'll be building from source as I need this fix ASAP :)
When I try to run "make all" on a makefile with some complexity I get this errors:
C:\BITCLOUD\BitCloud_PS_SAM3S_EK_1_10_0\BitCloud_PS_SAM3S_EK_1_10_0\Applications\ZAppSi\Dem o\SEDevice>make all
make -C makefiles/PC -f Makefile_PC_Gcc all APP_NAME=DemoSE
make[1]: Entering directory
'C:/BITCLOUD/BitCloud_PS_SAM3S_EK_1_10_0/BitCloud_PS_SAM3S_EK_1_10_0/Applications/ZAppSi/Demo/SEDevice/makefiles/PC'
A sintaxe do comando está incorrecta.
make[1]: *** [directories] Error 1
make[1]: Leaving directory
'C:/BITCLOUD/BitCloud_PS_SAM3S_EK_1_10_0/BitCloud_PS_SAM3S_EK_1_10_0/Applications/ZAppSi/Demo/SEDevice/makefiles/PC'
make: *** [all] Error 2
where the line
A sintaxe do comando está incorrecta.
translated to english means: "The syntax of the command is incorrect"
I already tried to change the project to different directories, check spaces in file names, using GNU make and also use MinGW make (mingw32-make) and the result is the same with both "make". I also checked for all files that are included in the makefile and they correspond.
Im not an expert in makefiles, so Im asking for help.
What is the main problem that occurs when make throws this type of error?
It is likely not make that throws this error, but a command executed by make returns with a nonzero exit status, in this case with status 1 (due to Error 1); then the top level make stops with Error 2. Note that make by default stops as soon as a command fails.
Since the output doesn't show what command was executed, there is no way to tell what went wrong exactly.
EDIT: from the GNU make manual:
-d Print debugging information in addition to normal processing.
The debugging information says which files are being considered
for remaking, which file-times are being compared and with what
results, which files actually need to be remade, which implicit
rules are considered and which are applied---everything inter‐
esting about how make decides what to do.
--debug[=FLAGS]
Print debugging information in addition to normal processing.
If the FLAGS are omitted, then the behavior is the same as if -d
was specified. FLAGS may be a for all debugging output (same as
using -d), b for basic debugging, v for more verbose basic
debugging, i for showing implicit rules, j for details on invo‐
cation of commands, and m for debugging while remaking make‐
files.
I suggest running make --debug=j to see the commands.
I have floowing directory structure for gstraemer sources
/home/dev/cerbero/sources/linux_x86_64/gst-plugins-bad-0.10.23
when i run .autogen.sh it runs fine
but when i do "make"
it gives follwing error
gst-plugins_bad_compile_error
more specifically
Making all in gst
make[2]: Entering directory `/home/dev/cerbero/sources/linux_x86_64/gst-plugins- bad-0.10.23/gst'
make -C adpcmdec
make[3]: Entering directory `/home/dev/cerbero/sources/linux_x86_64/gst-plugins-bad-0.10.23/gst/adpcmdec'
CC libgstadpcmdec_la-adpcmdec.lo
adpcmdec.c:586:21: error: expected declaration specifiers or '...' before '(' token
adpcmdec.c:586:40: error: expected declaration specifiers or '...' before '(' token
adpcmdec.c:586:59: error: unknown type name 'adpcmdec'
adpcmdec.c:587:5: error: expected declaration specifiers or '...' before string constant
adpcmdec.c:587:22: error: expected declaration specifiers or '...' before 'plugin_init'
adpcmdec.c:587:35: error: expected declaration specifiers or '...' before string constant
adpcmdec.c:587:44: error: expected declaration specifiers or '...' before string constant
adpcmdec.c:587:52: error: expected declaration specifiers or '...' before string constant
adpcmdec.c:588:5: error: expected declaration specifiers or '...' before string constant
adpcmdec.c:576:1: warning: 'plugin_init' defined but not used [-Wunused-function]
make[3]: *** [libgstadpcmdec_la-adpcmdec.lo] Error 1
make[3]: Leaving directory `/home/dev/cerbero/sources/linux_x86_64/gst-plugins- bad-0.10.23/gst/adpcmdec'
make[2]: *** [adpcmdec] Error 2
make[2]: Leaving directory `/home/dev/cerbero/sources/linux_x86_64/gst-plugins-bad-0.10.23/gst'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/dev/cerbero/sources/linux_x86_64/gst-plugins-bad-0.10.23'
make: *** [all] Error 2
A few things before building
Did you install liboil and orc compiler?
Run ./configure --enable-orc if you did
Run ./configure if you did not
Then do a make
Remember to do a make distclean before you do the above steps or simply get clean source. Do not do this on the "dirty half built" folder directly.
It could also be a nasm/yasm issue though I doubt it.
EDIT: My suggestin is to install orc because that speeds up gstreamer by a lot!