Is it possible to memcpy contiguous HDF5 datasets? - performance

I wish to aggregate several HDF5 files into one: this operation requires at some point to copy all the "source" datasets into the resulting one side-by-side.
Is there a way to do this in a memcpy-style ? The H5Dwrite and H5Dread functions would require the use of a temporary buffer and thus an additional allocation / copy / free.
I am aware HDF5 specification allow datasets to be physically stored in chunks. But this is not the case here.

Related

Dataset does not fit in memory

I have an MNIST like dataset that does not fit in memory, (process memory, not gpu memory).
My dataset is 4GB.
This is not a TFLearn issue.
As far as I know model.fit requires an array for x and y.
TFLearn example:
model.fit(x, y, n_epoch=10, validation_set=(val_x, val_y))
I was wondering is there's a way where we can pass a "batch iterator", instead of an array.
Basically for each batch I would load the necessary data from disk.
This way I would not run into process memory overflow errors.
EDIT
np.memmap could be an option. But I don't see how to skip the first few bytes that compose the header.
You can use the Dataset api.
"The Dataset API supports a variety of file formats so that you can process large datasets that do not fit in memory"
Basically the input pipeline would become part of your graph.
If memory is still an issue then you can use a generator to create your tf.data.Dataset. Further, you could potentially make the process quicker by preparing tfrecords to create your Dataset.

Which is faster, Loading assets from a big package or from many files spread across the filesystem?

Many applications, mostly video-games (mainly ones with constant content streaming), takes the approach of having a single big package file containing many of the game assets, that's [arguably] used for security-and space efficiency reasons, but perhaps there's a performance reason?
Given an application that loads 3+ assets file on each 5 seconds using asynchronous I/O in a secondary streaming thread, would it be technically faster if the I/O was being performed on a single big file by seeking and reading to the various offsets of the assets when necessary or to read each file separately spread across the operating system file system?
There would probably be a different across HDDs, SDDs and other factors, what are those?
Assume the files aren't fragmented (i.e. reserved space for those was applied during installation before effectively writing the content), but specifying the affects on fragmentation on the result would be interesting too.
The files spread on the filesystem approach seems interesting for quick production and modding support, but if there's a performance penalty, care should be taken.
This question is purely theoretical and curiosity at this point.
Creating a file handle/descriptor requires security validation and file system metadata operations. So is arguably that the many small file operations will be slower than one operation one one big file. Whether is a measurable difference in the specific context of your code, that remains to be demonstrated.
BTW True asynchronous IO should no require 'secondary' threads. You likely describe synchronous IO performed on a different thread, a completely different beast.
Achieving high throughput IO is very platform specific. To illustrate, for Windows specific read Designing Applications for High Performance - Part 1, Part 2 and Part 3.
Opening a file requires string processing to verify the file name, finding the directory the block references are in and then seeking to the first block of the file to start reading.
The directory lookup can be cached but all the rest has a cost per file.
Also having one big file lets the OS know that it will be accessed in one go and read the next blocks of the file while you are processing the data; however if you only read some assets then this read-ahead will possibly read unnecessary blocks.
A better solution would be a hybrid approach: collect assets that are often loaded together in buckets and have a database that say per level which asset buckets need to be read.
You can also duplicate the data across several buckets; in extreme each level has a single bucket with all the assets that level needs. This takes up more space but will net you the greatest speedup as you only need to dump 1 file into memory.
This presentation talks about how you can create a good distribution of seeks (files in this context) vs. amount of assets within a storage budget.
When creating a game for PC you can let the use decide on install which side of the coin he wants, the quick loading of 1 file per level or the space saving of each asset stored only once.

Clojure Time Series Analysis

I have a large data set (200GB uncompressed, 9GB compressed in bz2 -9 ) of stock tick data.
I want to run some basic time series analysis on them.
My machine has 16GB of RAM.
I would prefer to:
keep all data, compressed, in memory
decompress that data on the fly, and stream it [so nothing ever hits disk]
do all analysis in memory
Now, I think there's nice interactions here with Clojure's laziness, and future objects (i.e. I can define objects s.t. when I try to access them, I'll decompress them on the fly.)
Question: what are the things I should keep in mind when doing high performance time series analysis in Clojure?
I'm particular interested in tricks involving:
efficiently storing tick data in memory
efficiently doing computation
weird convolutions to reduce # of passes over the data
Books / articles / research paper suggestions welcome. (I'm a CS PhD student).
Thanks.
Some ideas:
In terms of storing the compressed data, I don't think you will be able to do much better than your OS's own file system caching. Just make sure it s configured to use 11GB+ of RAM for file system caching and it should pull your whole compressed data set into memory as it is read the first time.
You should then be able to define your Clojure code to pull into the data lazily via a ZipInputStream, which will perform the decompression for you.
If you need to perform a second pass on the data, just create a new ZipInputStream on the same file. OS level caching should ensure that you don't hit the disk again.
I have heard of systems like that implemented in Java. It is possible. You'll certainly want to understand how to create your own lazy sequences in order to accomplish this. I also wouldn't hesitate to drop down into Java if you need to make sure that you're dealing with the primitive types that you want to deal with. e.g. Clojure won't generate code to do math on 32-bit ints, it will only generate code to work with longs, and if you don't want that it could be a pain.
It would also be worth some effort to make your in-memory format compatible with a disk format. That would give you the option of memory mapping files, or (at the very least) make your startup easy if your program were to crash. e.g. It could just read the files on disk to recover its previous state.

Fast, low-memory, constant key-value database supporting concurrent and random access reads

I need an on-disk key-value store, not too big or distributed. The use case is as follows:
The full DB will be few Gbs in size
Both key and value are of constant size
Its a constant data base. Once the entire database is written I don't need to write any more entries (or write very infrequently)
Keys will be accessed in unpredictable order
Supporting concurrent reads by multiple processes is a must.
Have to be very fast because the readers will be accessing millions of keys in a tight loop. So it should be as close as possible to being as performant as looping over an associative array (STL's std::map say)
Ideally it should allow one to set how much RAM to use, typically it should use a few hundreds of Mbs
Written in C or C++. An existing python extension will be a big plus, but iI can add that on my own
So cdb and gdbm look like good choices, but just wanted to know if there are more suitable choices. Pointers to relevant benchmarks or even relevant anecdotal evidence will be appreciated.
What database did you end up using?
If you like cdb and you need > 4 GB database, please have a look at mcdb, which is originally based on cdb, plus some performance enhancements and the addition of support for 4 GB+ constant databases.
https://github.com/gstrauss/mcdb/
Python, Perl, Lua, and Ruby extensions are provided. mcdb is written in C and uses mmap under the hood and so easily supports lock-free concurrent reads between threads and between processes. Since it is backed by a memory-mapped file, pages are mapped in from disk as needed and memory is effectively constant even as the number of processes accessing the database increases.
Have you looked at bdb? It sounds like a good use of BDB.
i like hamsterdb because i wrote it :)
http://www.hamsterdb.com
frequently used with database sizes of several GBs
keys/values can have any size you wish
random access + directional access (with cursors)
concurrent reads: hamsterdb is thread safe, but not yet concurrent. i'm working on this.
if your cache is big enough then access will be very fast (you can specify the cache size)
written in c++
python extension is available, but terribly outdated; will need fixes
if you want to evaluate hamsterdb and need some help then feel free to drop me a mail.

How is fseek() implemented in the filesystem?

This is not a pure programming question, however it impacts the performance of programs using fseek(), hence it is important to know how it works. A little disclaimer so that it doesn't get closed.
I am wondering how efficient it is to insert data in the middle of the file. Supposing I have a file with 1MB data and then I insert something at the 512KB offset. How efficient would that be compared to appending my data at the end of the file? Just to make the example complete lets say I want to insert 16KB of data.
I understand the answer varies depending on the filesystem, however I assume that the techniques used in common filesystems are quite similar and I just want to get the right notion of it.
(disclaimer: I want just to add some hints to this interesting discussion)
IMHO there are some things to take into account:
1) fseek is not a primary system service, but a library function. To evaluate its performance we must consider how the file stream library is implemented. In general, the file I/O library adds a layer of buffering in user space, so the performance of fseek may be quite different if the target position is inside or outside the current buffer. Also, the system services that the I/O libary uses may vary a lot. I.e. on some systems the library uses extensively the file memory mapping if possible.
2) As you said, different filesystems may behave in a very different way. In particular, I would expect that a transactional filesystem must do something very smart and perhaps expensive to be prepared to a possible rollback of an aborted write operation in the middle of a file.
3) Modern OS'es have very aggressive caching algorithms. An "fseeked" file is likely to be already present in cache, so operations become much faster. But they may degrade a lot if the overall filesystem activity produced by other processes become important.
Any comments?
fseek(...) is a library call, not an OS system call. It is the run-time library that takes care of the actual overhead involved in making a system call to the OS, technically speaking, fseek is indirectly making a call to the system but really it is not (this brings up a clear distinction between the differences between a library call and a system call). fseek(...) is a standard input-output function regardless of the underlying system...however...and this is a big however...
The OS will more than likely to have cached the file in its kernel memory, that is, the direct offset to the location on the disk on where the 1's and 0's are stored, it is through the OS's kernel layers, more than likely, a top-most layer within the kernel that would have the snapshot of what the file is composed of, i.e. data irrespectively of what it contains (it does not care either way, as long as the 'pointers' to the disk structure for that offset to the lcoation on the disk is valid!)...
When fseek(..) occurs, there would be a lot of over-head, indirectly, the kernel delegated the task of reading from the disk, depending on how fragmented the file is, it could be theoretically, "all over the place", that could be a significant over-head in terms of having to, from a user-land perspective, i.e. the C code doing an fseek(...), it could be scattering itself all over the place to gather the data into a "one contiguous view of the data" and henceforth, inserting into the middle of a file, (remember at this stage, the kernel would have to adjust the location/offsets into the actual disk platter for the data) would be deemed slower than appending to the end of the file.
The reason is quite simple, the kernel "knows" what was the last offset was, and simply wipe the EOF marker and insert more data, behind the scenes, the kernel, is having to allocate another block of memory for the disk-buffer with the adjusted offset to the location on the disk following an EOF marker, once the appending of data is completed.
Let us assume the ext2 FS and the Linux OS as an example. I don't think there will be a significant performance difference between a insert and an append. In both cases the files node and offset table must be read, the relevant disk sector mapped into memory, the data updated and at some later point the data written back to disk. What will make a big performance difference in this example is good temporal and spatial locality when accessing parts of the file since this will reduce the number of load/store combos.
As a previous answers says you may be able to speed up both operations if you deal with data writes that exact multiples of the FS block size, in this case you could skip the load stage and just insert the new blocks into the files inode datastrucure. This would not be practical, as you would need low level access to the FS driver, and using it would be very restrictive and not portable.
One observation I have made about fseek on Solaris, is that each call to it resets the read buffer of the FILE. The next read will then always read a full block (8K by default). So if you have a lot of random access with small reads it's a good idea to do it unbuffered (setvbuf with NULL buffer) or even use direct syscalls (lseek+read or even better pread which is only 1 syscall instead of 2). I suppose this behaviour will be similar on other OS.
You can insert data to the middle of file efficiently only if data size is a multiple of FS sector but OSes doesn't provide such functions so you have to use low-level interface to the FS driver.
Inserting data in the middle of the file is less efficient than appending to the end because when inserting you would have to move the data after the insertion point to make room for the data being inserted. Moving these data would involve reading them from disk, writing the data to be inserted and then writing the old data after the inserted data. So you have at least one extra read and write when inserting.

Resources