Dataset does not fit in memory - memory-management

I have an MNIST like dataset that does not fit in memory, (process memory, not gpu memory).
My dataset is 4GB.
This is not a TFLearn issue.
As far as I know model.fit requires an array for x and y.
TFLearn example:
model.fit(x, y, n_epoch=10, validation_set=(val_x, val_y))
I was wondering is there's a way where we can pass a "batch iterator", instead of an array.
Basically for each batch I would load the necessary data from disk.
This way I would not run into process memory overflow errors.
EDIT
np.memmap could be an option. But I don't see how to skip the first few bytes that compose the header.

You can use the Dataset api.
"The Dataset API supports a variety of file formats so that you can process large datasets that do not fit in memory"
Basically the input pipeline would become part of your graph.
If memory is still an issue then you can use a generator to create your tf.data.Dataset. Further, you could potentially make the process quicker by preparing tfrecords to create your Dataset.

Related

Tensorflow Dataset performances?

I am implementing a model inspired by the NMT model. I am using a training set stored as TFRecords files, using a TFRecordDataset to fetch it and feed the model. Following Google's recommendations about input pipeline performances improvement, I have:
preprocessed as much as possible beforehand on CPU
stacked several training examples up to about 100 MB TFrecords files (having less files containing more examples)
used num_parallel_calls and prefetch on the Dataset map operations.
However, GPU remains at maximum 40%, and it is barely as slow as when run on CPU. I am thus wondering about the prefetch operation.
If I understand correctly, it will create a special thread that buffers N examples. But what does it mean ? What happens to the other examples not buffered ?
is there an optimal relation between the prefetch buffer size, the number of examples in the complete Dataset and the batch size ? In the NMT code, prefetch buffer size is set at 1000*batch_size, but why ? If e.g. I am using 10000 examples, a batch size of 100, what should be the prefetch buffer size ?
Any other advice regarding Dataset speedup would be appreciated.
Apparently, Dataset API runs on CPU and not on GPU, so this answers the question.

Large 3D volume bad_alloc

I'm developing an application that creates a 3D Voronoi Diagram created from a 3D point cloud using boost multi_array allocated dynamically to store the whole diagram.
One of the test cases I'm using requires a large amount of memory (around [600][600][600]), which is over the limit allowed and results in bad_alloc.
I already tried to separate the diagram in small pieces but also it doesn't work, as it seems that the total memory is already over the limits.
My question is, how can I work with such large 3D volume under the PC constraints?
EDIT
The Element type is a struct as follows:
struct Elem{
int R[3];
int d;
int label;
}
The elements are indexed in the multiarray based on their position in the 3D space.
The multiarray is constructed by setting specific points on the space from a file and then filling the intermediate spaces by passing a forward and a backward mask over the whole space.
You didn't say how do you get all your points. If you read them from a file, then don't read them all. If you compute them, then you can probably recompute them as needed. In both cases you can implement some cache that will store most often used ones. If you know how your algorithm will use the data, then you can predict which values will be needed next. You can even do this in a different thread.
The second solution is to work on your data so they fit in your RAM. You have 216 millions of points, but we don't know what's the size of a point. They are 3D but do they use floats or doubles? Are they a classes or simple structs? Do they have vtables? Do you use Debug build? (in Debug objects may be bigger). Do you allocate entire array at the beginning or incrementally? I believe there should be no problem storing 216M of 3D points on current PC but it depends on answers for all those questions.
The third way that comes to my mind is to use Memory Mapped Files, but i never used them personally.
Here are few things to try:
Try to allocate in different batches, like: 1 * 216M, 1k * 216k, 1M * 216 to see how much memory can you get.
Try to change boost map to std::vector and even raw void* and compare maximum RAM you can get.
You didn't mention the element type. Give the element is a four-byte float, a 600*600*600 matrix only takes about 820M bytes, which is not very big actually. I'd suggest you to check your operating system's limit on memory usage per process. For Linux, check it with ulimit -a.
If you really cannot allocate the matrix in memory, create a file of desired size on disk map it to memory using mmap. Then pass the memory address returned by mmap to boost::multi_array_ref.

Clojure Time Series Analysis

I have a large data set (200GB uncompressed, 9GB compressed in bz2 -9 ) of stock tick data.
I want to run some basic time series analysis on them.
My machine has 16GB of RAM.
I would prefer to:
keep all data, compressed, in memory
decompress that data on the fly, and stream it [so nothing ever hits disk]
do all analysis in memory
Now, I think there's nice interactions here with Clojure's laziness, and future objects (i.e. I can define objects s.t. when I try to access them, I'll decompress them on the fly.)
Question: what are the things I should keep in mind when doing high performance time series analysis in Clojure?
I'm particular interested in tricks involving:
efficiently storing tick data in memory
efficiently doing computation
weird convolutions to reduce # of passes over the data
Books / articles / research paper suggestions welcome. (I'm a CS PhD student).
Thanks.
Some ideas:
In terms of storing the compressed data, I don't think you will be able to do much better than your OS's own file system caching. Just make sure it s configured to use 11GB+ of RAM for file system caching and it should pull your whole compressed data set into memory as it is read the first time.
You should then be able to define your Clojure code to pull into the data lazily via a ZipInputStream, which will perform the decompression for you.
If you need to perform a second pass on the data, just create a new ZipInputStream on the same file. OS level caching should ensure that you don't hit the disk again.
I have heard of systems like that implemented in Java. It is possible. You'll certainly want to understand how to create your own lazy sequences in order to accomplish this. I also wouldn't hesitate to drop down into Java if you need to make sure that you're dealing with the primitive types that you want to deal with. e.g. Clojure won't generate code to do math on 32-bit ints, it will only generate code to work with longs, and if you don't want that it could be a pain.
It would also be worth some effort to make your in-memory format compatible with a disk format. That would give you the option of memory mapping files, or (at the very least) make your startup easy if your program were to crash. e.g. It could just read the files on disk to recover its previous state.

Reading in a very large file

I'm working on an application that reads in a huge text file (can be up to 5gb in size). Currently, I am using fscanf to read in the file, because I have found it to be the fastest so far. However, it still takes quite a large quantity of time to read the whole file in.
Is there a faster way to read in data from a file?
First, you should strongly avoid reading a 5GB file into memory as a single step. The memory impact alone should keep you away from this approach. Instead, you should try to take another approach such as:
Process the data as you read it and throw away the data
Convert the file to a Core Data model prior to work
Convert the file to a fixed-length record format so you can do random-access
Modify the file format so that it is less redundant
Index the file so you can do random-access
Partition the data into separate files
Memory map the file using NSFileWrapper (far from a panacea, but can be useful in conjunction with the above; NSFileWrapper automatically does memory mapping)
You should start by getting a performance baseline:
time cat thebigfile.dat > /dev/null
It is hard to imagine reading the file much faster than that, so that's your floor.
You should definitely do some performance analysis in Instruments and make sure the problem is the reading and not the processing. In particular, memory allocation can be more expensive than you may expect, particularly in a multi-threaded app.
Once you've investigated the above, and you still need really fast management of on-disk data, look at dispatch_io and dispatch_data. This is a really awesome tool for high-speed data management. But it is almost always better to improve your basic algorithms first before worrying about this kind of optimization.

Optimizing locations of on-disk data for sequential access

I need to store large amounts of data on-disk in approximately 1k blocks. I will be accessing these objects in a way that is hard to predict, but where patterns probably exist.
Is there an algorithm or heuristic I can use that will rearrange the objects on disk based on my access patterns to try to maximize sequential access, and thus minimize disk seek time?
On modern OSes (Windows, Linux, etc) there is absolutely nothing you can do to optimise seek times! Here's why:
You are in a pre-emptive multitasking system. Your application and all it's data can be flushed to disk at any time - user switches task, screen saver kicks in, battery runs out of charge, etc.
You cannot guarantee that the file is contiguous on disk. Doing Aaron's first bullet point will not ensure an unfragmented file. When you start writing the file, the OS doesn't know how big the file is going to be so it could put it in a small space, fragmenting it as you write more data to it.
Memory mapping the file only works as long as the file size is less than the available address range in your application. On Win32, the amount of address space available is about 2Gb - memory used by application. Mapping larger files usually involves un-mapping and re-mapping portions of the file, which won't be the best of things to do.
Putting data in the centre of the file is no help as, for all you know, the central portion of the file could be the most fragmented bit.
To paraphrase Raymond Chen, if you have to ask about OS limits, you're probably doing something wrong. Treat your filesystem as an immutable black box, it just is what it is (I know, you can use RAID and so on to help).
The first step you must take (and must be taken whenever you're optimising) is to measure what you've currently got. Never assume anything. Verify everything with hard data.
From your post, it sounds like you haven't actually written any code yet, or, if you have, there is no performance problem at the moment.
The only real solution is to look at the bigger picture and develop methods to get data off the disk without stalling the application. This would usually be through asynchronous access and speculative loading. If your application is always accessing the disk and doing work with small subsets of the data, you may want to consider reorganising the data to put all the useful stuff in one place and the other data elsewhere. Without knowing the full problem domain it's not possible to to be really helpful.
Depending on what you mean by "hard to predict", I can think of a few options:
If you always seek based on the same block field/property, store the records on disk sorted by that field. This lets you use binary search for O(log n) efficiency.
If you seek on different block fields, consider storing an external index for each field. A b-tree gives you O(log n) efficiency. When you seek, grab the appropriate index, search it for your block's data file address and jump to it.
Better yet, if your blocks are homogeneous, consider breaking them down into database records. A database gives you optimized storage, indexing, and the ability to perform advanced queries for free.
Use memory-mapped file access rather than the usual open-seek-read/write pattern. This technique works on Windows and Unix platforms.
In this way the operating system's virtual memory system will handle the caching for you. Accesses of blocks that are already in memory will result in no disk seek or read time. Writes from memory back to disk are handled automatically and efficiently and without blocking your application.
Aaron's notes are good too as they will affect initial-load time for a chunk that's not in memory. Combine that with the memory-mapped technique -- after all it's easier to reorder chunks using memcpy() than by reading/writing from disk and attempting swapouts etc.
The most simple way to solve this is to use an OS which solves that for you under the hood, like Linux. Give it enough RAM to hold 10% of the objects in RAM and it will try to keep as many of them in the cache as possible reducing the load time to 0. The recent server versions of Windows might work, too (some of them didn't for me, that's why I'm mentioning this).
If this is a no go, try this algorithm:
Create a very big file on the harddisk. It is very important that you write this in one go so the OS will allocate a continuous space on disk.
Write all your objects into that file. Make sure that each object is the same size (or give each the same space in the file and note the length in the first few bytes of of each chunk). Use an empty harddisk or a disk which has just been defragmented.
In a data structure, keep the offsets of each data chunk and how often it is accessed. When it is accessed very often, swap its position in the file with a chunk that is closer to the start of the file and which has a lesser access count.
[EDIT] Access this file with the memory-mapped API of your OS to allow the OS to effectively cache the most used parts to get best performance until you can optimize the file layout next time.
Over time, heavily accessed chunks will bubble to the top. Note that you can collect the access patterns over some time, analyze them and do the reorder over night when there is little load on your machine. Or you can do the reorder on a completely different machine and swap the file (and the offset table) when that's done.
That said, you should really rely on a modern OS where a lot of clever people have thought long and hard to solve these issues for you.
That's an interesting challenge. Unfortunately, I don't know how to solve this out of the box, either. Corbin's approach sounds reasonable to me.
Here's a little optimization suggestion, at least: Place the most-accessed items at the center of your disk (or unfragmented file), not at the start of end. That way, seeking to lesser-used data will be closer by average. Err, that's pretty obvious, though.
Please let us know if you figure out a solution yourself.

Resources