Does HDF5 support concurrent reads, or writes to different files? - anaconda

I'm trying to understand the limits of HDF5 concurrency.
There are two builds of HDF5: parallel HDF5 and default. The parallel version is is currently supplied in Ubuntu, and the default in Anaconda (judged by --enable-parallel flag).
I know that parallel writes to the same file are impossible. However, I don't fully understand to what extend the following actions are possible with default or with parallel build:
several processes reading from the same file
several processes reading from different files
several processes writing to different files.
Also, are there any reasons anaconda does not have --enable-parallel flag on by default? (https://github.com/conda/conda-recipes/blob/master/hdf5/build.sh)

AFAICT, there are three ways to build libhdf5:
with neither thread-safety nor MPI support (as in the conda recipe you posted)
with MPI support but no thread safety
with thread safety but no MPI support
That is, the --enable-threadsafe and --enable-parallel flags are mutually exclusive (https://www.hdfgroup.org/hdf5-quest.html#p5thread).
As for concurrent reads on one or even multiple files, the answer is that you need thread safety (https://www.hdfgroup.org/hdf5-quest.html#tsafe):
Concurrent access to one or more HDF5 file(s) from multiple threads in
the same process will not work with a non-thread-safe build of the
HDF5 library. The pre-built binaries that are available for download
are not thread-safe.
Users are often surprised to learn that (1) concurrent access to
different datasets in a single HDF5 file and (2) concurrent access to
different HDF5 files both require a thread-safe version of the HDF5
library. Although each thread in these examples is accessing different
data, the HDF5 library modifies global data structures that are
independent of a particular HDF5 dataset or HDF5 file. HDF5 relies on
a semaphore around the library API calls in the thread-safe version of
the library to protect the data structure from corruption by
simultaneous manipulation from different threads. Examples of HDF5
library global data structures that must be protected are the
freespace manager and open file lists.
Edit: The links above no longer work because the HDF Group reorganised their website. There is a page Questions about thread-safety and concurrent access in the HDF5 Knowledge Base that contains some useful information.
While only concurrent threads on a single process are mentioned in the passage, it appears to apply equally to forked subprocesses: see this h5py multiprocessing example.
Now, for parallel access, you might want to use "Parallel HDF5" but those features requires using MPI. This pattern is supported by h5py but is more complicated and esoteric, and probably even less portable than thread-safe mode. More importantly, trying to naively do concurrent reads with a parallel build of libhdf5 will lead to unexpected results because the library isn't thread-safe.
Besides efficiency, one limitation of the thread-safe build flag is lack of Windows support (https://www.hdfgroup.org/hdf5-quest.html#gconc):
The thread-safe version of HDF5 is currently not tested or supported
on MS Windows platforms. A user was able to get this working on
Windows 64-bit and contributed his Windows 64-bit Pthreads patches.
Getting weird corrupt results when reading (different!) files from Python is definitely unexpected and frustrating given how concurrent read access is one of the touted "features" of HDF5. Perhaps a better default recipe for conda would be to include --enable-threadsafe on those platforms that support it, but I guess then you would end up with platform-specific behavior. Maybe there ought to be separate packages for the three build modes instead?

Just to add:
I think independent concurrent processes (i.e. python) doing read access should be fine
HDF5 1.10 will support Single Writer Multiple Reader,more infos and also h5py 2.5.0 will have support for it

Related

Build output of SpiderMonkey under Windows

I built SpiderMonkey 60 under Windows (VS2017) according to the documentation, using
../configure --enable-nspr-build followed by mozmake.
In the output folder (dist\bin) I could see 5 DLLs created:
mozglue.dll, mozjs-60.dll, nspr4.dll, plc4.dll, plds4.dll
In order to run the SpiderMonkey Hello World sample I linked my C++ program with mozjs-60.lib and had to copy over to my program location the following DLLs: mozglue.dll, mozjs-60.dll, nspr4.dll
It seems that plc4.dll, plds4.dll are not needed for the program to run and execute scripts.
I could not find any documentation about what is the purpose of each one of the DLLs. Do I need all 5 DLLs? what is the purpose of each one?
Quoting from NSPR archived release notes for an old version I found this:
The plc (Portable Library C) library is a separate library from the
core nspr. You do not need to use plc if you just want to use the core
nspr functions. The plc library currently contains thread-safe string
functions and functions for processing command-line options.
The plds (Portable Library Data Structures) library supports data
structures such as arenas and hash tables. It is important to note
that services of plds are not thread-safe. To use these services in a
multi-threaded environment, clients have to implement their own
thread-safe access, by acquiring locks/monitors, for example.
It sounds like they are unused unless specifically loaded by your application.
It seems it would be safe to not distribute these if you don't need them.

Why does CGO_ENABLE make a such impact on virtual memory?

I have a small daemon written on Golang, which works in a loop and does some stuff. I've discovered, the daemon behaves differently in cases when it's compiled with CGO_ENABLE=1 or CGO_ENABLED=0. For example, with CGO_ENABLE=1 (which is default) the program's VSZ bloats up to 1-2GB during short period of time (within a hour). With CGO_ENABLED=0, VSZ is the same during long period of time (over days). Look at the numbers below:
CGO_ENABLED=1 (daemon has worked 5 minutes)
$ grep -E 'VmSize|VmRSS' /proc/14916/status
VmSize: 1084052 kB
VmRSS: 12524 kB
CGO_ENABLED=0 (daemon has worked ~30 hours)
$ grep -E 'VmSize|VmRSS' /proc/15160/status
VmSize: 110232 kB
VmRSS: 9756 kB
The daemon is not used CGO-dependent packages or functions. Other Go-written programs show the same behaviour. I know the difference between VSZ and RSS and I'm interesting what is the nature of such behaviour? Why program compiled with CGO_ENABLED=1 asks to provide so much memory from the kernel?
I would prefer answers that are not in the form "don't worry, VSZ is a just virtual memory, and really it's not used by process".
I could make an educated guess.
As you probably know, the compiler of the "reference" Go implementation (historically dubbed "gc"; that one, available for download from the main site) by default produces statically-linked binaries. This means, such binaries rely only on the so-called "system calls" provided by the OS kernel and do not depend on any shared libraries provided by the OS (or 3rd parties).
On Linux-based platforms, this is not completely true: in the default setting (building on Linux for Linux, i.e., not cross-compiling) the generated binary is actually linked with libc
and with libpthread (indirectly, via libc).
This "twist" comes out of the two needs the Go standard library has to interact with the OS:
DNS resolving, which is needed by the net package.
User and group lookup, which is needed by the os package.
The problem here is two-fold:
The Linux itself (that is, the kernel, not the whole OS) does not provide any means to carry out those tasks.
Any typical UNIX-like system, since forever, provides for both those tasks using a special facility called "NSS",
which is the "Name-Service Switch"¹.
The NSS provides for pluggable modules which can serve
as the databases offering queries of particular type: DNS, user/group database, and more (such as well-known names for "services" etc). A supposedly rather common example of
a non-standard provider for the user/group databases is a local
service which contacts an LDAP server.
On a typical GNU/Linux-based OS the NSS is implemented by
libc (on less typical systems it might be provided by a
separate shared library but this does not change much).
Since — again, typically, — the libc is a rather stable
library in terms of its API (it even provides versioned symbols
to be future-proof), the Go authors rightfully decided that linking against libc to import a minimal subset of symbols (mostly getaddrinfo, getnameinfo, getpwnam_r etc) is OK
to be done by default as it's safe for 99% of cases,
and when it isn't, those who have to tackle these cases usually
know what to do anyway.
So, by default cgo is enabled and used to implement these lookups using NSS.
If cgo is disabled, the Go compiler instead links in its own
fallback implementations which try to mimic a subset of what a
full-blown NSS implementation does (i.e. parse /etc/resolv.conf and use the information from it to directly query the DNS servers listed here; parse /etc/passwd and /etc/group to serve the user/group database queries).
As you can see, in the defult case,
The libc gets mapped in, and
It is initialized and uses some memory for its own needs —
such as obvious caching of the data the NSS calls return.
Conversely, in the case when cgo is disabled, the above two things do not happen. You have more stdlib code linked in statically but looks like the default case merely trumps the latter one in terms of the overall cumulative RSS usage.
Consider studying the output of
this query
for additional fun ;-)
¹ not to be confused with Mozilla's libnss.

Why prefer distributing a shared library with executables instead of linking statically?

Scenario: two unrelated pieces of software are going to be distributed with their own copy of the same shared library. They will both be installed on the same machine (running Windows), and they're going to be run at the same time.
In this scenario - from my understanding, the two programs won't share the library in memory without somehow specifying it - which doesn't seem to be the norm (correct me if I'm wrong)... In other words, most or all of the programs that use this library will have their own copy of it, both in memory and on disk, which is the same as what statically linked programs would have - roughly speaking.
Is it preferable for the writers of each program to ship the shared library (together with their programs) over linking with the library statically, or is the difference negligible?

how to run a openmp program on clusters with multiple nodes? [duplicate]

I want to know if it would be possible to run an OpenMP program on multiple hosts. So far I only heard of programs that can be executed on multiple thread but all within the same physical computer. Is it possible to execute a program on two (or more) clients? I don't want to use MPI.
Yes, it is possible to run OpenMP programs on a distributed system, but I doubt it is within the reach of every user around. ScaleMP offers vSMP - an expensive commercial hypervisor software that allows one to create a virtual NUMA machine on top of many networked hosts, then run a regular OS (Linux or Windows) inside this VM. It requires a fast network interconnect (e.g. InfiniBand) and dedicated hosts (since it runs as a hypervisor beneath the normal OS). We have an operational vSMP cluster here and it runs unmodified OpenMP applications, but performance is strongly dependent on data hierarchy and access patterns.
NICTA used to develop similar SSI hypervisor named vNUMA, but development also stopped. Besides their solution was IA64-specific (IA64 is Intel Itanium, not to be mistaken with Intel64, which is their current generation of x86 CPUs).
Intel used to develop Cluster OpenMP (ClOMP; not to be mistaken with the similarly named project to bring OpenMP support to Clang), but it was abandoned due to "general lack of interest among customers and fewer cases than expected where it showed a benefit" (from here). ClOMP was an Intel extension to OpenMP and it was built into the Intel compiler suite, e.g. you couldn't use it with GCC (this request to start ClOMP development for GCC went in the limbo). If you have access to old versions of Intel compilers (versions from 9.1 to 11.1), you would have to obtain a (trial) ClOMP license, which might be next to impossible given that the product is dead and old (trial) licenses have already expired. Then again, starting with version 12.0, Intel compilers no longer support ClOMP.
Other research projects exist (just search for "distributed shared memory"), but only vSMP (the ScaleMP solution) seems to be mature enough for production HPC environments (and it's priced accordingly). Seems like most efforts now go into development of co-array languages (Co-Array Fortran, Unified Parallel C, etc.) instead. I would suggest that you have a look at Berkeley UPC or invest some time in learning MPI as it is definitely not going away in the years to come.
Before, there was the Cluster OpenMP.
Cluster OpenMP, was an implementation of OpenMP that could make use of multiple SMP machines without resorting to MPI. This advance had the advantage of eliminating the need to write explicit messaging code, as well as not mixing programming paradigms. The shared memory in Cluster OpenMP was maintained across all machines through a distributed shared-memory subsystem. Cluster OpenMP is based on the relaxed memory consistency of OpenMP, allowing shared variables to be made consistent only when absolutely necessary. source
Performance Considerations for Cluster OpenMP
Some memory operations are much more expensive than others. To achieve good performance with Cluster OpenMP, the number of accesses to unprotected pages must be as high as possible, relative to the number of accesses to protected pages. This means that once a page is brought up-to-date on a given node, a large number of accesses should be made to it before the next synchronization. In order to accomplish this, a program should have as little synchronization as possible, and re-use the data on a given page as much as possible. This translates to avoiding fine-grained synchronization, such as atomic constructs or locks, and having high data locality source.
Another option for running OpenMP programs on multiple hosts is the remote offloading plugin in the LLVM OpenMP runtime.
https://openmp.llvm.org/design/Runtimes.html#remote-offloading-plugin
The big issue with running OpenMP programs on distributed memory is data movement. Coincidentally, that is also one of the major issues in programming GPU's. Extending OpenMP to handle GPU programming has given rise to OpenMP directives to describe data transfer. Programming GPU's has also forced programmers to think more carefully about building programs that consider data movement.

Is it possible to write a libPOSIX for Windows (Win32) without requiring a background service or DLL that's always loaded?

I know about Cygwin, and I know of its shortcomings. I also know about the slowness of fork, but not why on Earth it's not possible to work around that. I also know Cygwin requires a DLL. I also understand POSIX defines a whole environment (shell, etc...), that's not really what I care about here.
My question is asking if there is another way to tackle the problem. I see more and more of POSIX functionality being implemented by the MinGW projects, but there's no complete solution providing a full-blown (comparable to Linux/Mac/BSD implementation status) POSIX functionality.
The question really boils down to:
Can the Win32 API (as of MSVC20??) be efficiently used to provide a complete POSIX layer over the Windows API?
Perhaps this will turn out to be a full libc that only taps into the OS library for low-level things like filesystem access, threads, and process control. But I don't know exactly what else POSIX consists of. I doubt a library can turn Win32 into a POSIX compliant entiity.
POSIX <> Win32.
If you're trying to write apps that target POSIX, why are you not using some variant of *N*X? If you prefer to run Windows, you can run Linux/BSD/whatever inside Hyper-V/VMWare/Parallels/VirtualBox on your PC/laptop/etc.
Windows used to have a POSIX compliant environment that ran alongside the Win32 subsystem, but was discontinued after NT4 due to lack of demand. Microsoft bought Interix and released Services For Unix (SFU). While it's still available for download, SFU 3.5 is now deprecated and no longer developed or supported.
As to why fork is so slow, you need to understand that fork isn't just "Create a new process", it's "create a new process (itself an expensive operation) which is a duplicate of the calling process along with all memory".
In *N*X, the forked process is mapped to the same memory pages as the parent (i.e. is pretty quick) and is only given new pages as and when the forked process tried to modify any shared pages. This is known as copy on write. This is largely achievable because in UNIX, there is no hard barrier between the parent and forked processes.
In NT, on the other hand, all processes are separated by a barrier enforced by CPU hardware. In NT, the easiest way to spawn a parallel activity which has access to your process' memory and resources, is to create a thread. Threads run within the memory space of the creating process and have access to all of the process' memory and resources.
You can also share data between processes via various forms of IPC, RPC, Named Pipes, mailslots, memory-mapped files but each technique has its own complexities, performance characteristics, etc. Read this for more details.
Because it tries to mimic UNIX, CygWin's 'fork' operation creates a new child process (in its own isolated memory space) and has to duplicate every page of memory in the parent process within the newly forked child. This can be a very costly operation.
Again, if you want to write POSIX code, do so in *N*X, not NT.
How about this
Most of the Unix API is implemented by the POSIX.DLL dynamically loaded (shared) library. Programs linked with POSIX.DLL run under the Win32 subsystem instead of the POSIX subsystem, so programs can freely intermix Unix and Win32 library calls.
From http://en.wikipedia.org/wiki/UWIN
The UWIN environment may be what you're looking for, but note that it is hosted at research.att.com, while UWIN is distributed under a liberal license it is not the GNU license. Also, as it is research for att, and only 2ndarily something that they are distributing for use, there are a lot of issues with documentation.
See more info see my write-up as the last answer for Regarding 'for' loop in KornShell
Hmm main UWIN link is bad link in that post, try
http://www2.research.att.com/sw/download/
Also, You can look at
https://mailman.research.att.com/pipermail/uwin-users/
OR
https://mailman.research.att.com/pipermail/uwin-developers/
To get a sense of the features vs issues.
I hope this helps.
The question really boils down to: Can the Win32 API (as of MSVC20??)
be efficiently used to provide a complete POSIX layer over the Windows
API?
Short answer: No.
"Complete POSIX" means fork(), mmap(), signal() and such, and these are [almost] impossible to implement on NT.
To drive the point home: GNU Hurd has problems with fork() as well, because Hurd kernel is not POSIX.
NT is not POSIX too.
Another difference is persisence:
In POSIX-compliant systems it is possible to create system objects and leave them there. Examples of such objects are named pipes and shared memory objects (shms). You can create a named pipe or a shm, and leave it in the filesystem (or in a special filesystem-like place) where other processes will be able to access it. The downside is that a process might die and fail to clean up after itself, leaving unused objects behind (you know about zombie processes? same thing).
In NT every object is reference-counted, and is destroyed as soon as its last handle is closed. Files are among the few objects that persist.
Symlinks are a filesystem feature, and don't exactly depend on NT kernel, but current implementation (in Vista and later) is incapable of creating object-type-agnostic symlinks. That is, a symlink is either a file or a directory, and must link to either a file or a directory. If the target has wrong type, the symlink won't work. You can give it the right type if the target exists when you create the symlink, but POSIX requires that symlinks may be created without their target existing. I can't imagine a use-case for a symlink that points first to a file, then to a directory, but POSIX says that this should work, and if it doesn't, you're not completely POSIX-compliant. Or if your symlinking API/utility can be given an option that specifies the right type, when target doesn't exist, that also breaks POSIX compatibility.
It is possible to replicate some POSIX features to some degree (such as "integer descriptors from in a single namespace, referencing any I/O object, and being select()able" without sacrificing [much] performance, but that is still a major undertaking, and POSIX interface is really restrictive (that is, if you could just add one more argument to that function, it would have been possible to Do The Right Thing...but you couldn't, unless you want to throw POSIX compliance away).
Your best bet is to not to rely on POSIX features that are difficult to port to non-POSIX systems, or abstract in such a way that lower levels may have separate implementations for different OSes, and upper levels do not care about the details.

Resources