Tensorflow Dataset performances? - performance

I am implementing a model inspired by the NMT model. I am using a training set stored as TFRecords files, using a TFRecordDataset to fetch it and feed the model. Following Google's recommendations about input pipeline performances improvement, I have:
preprocessed as much as possible beforehand on CPU
stacked several training examples up to about 100 MB TFrecords files (having less files containing more examples)
used num_parallel_calls and prefetch on the Dataset map operations.
However, GPU remains at maximum 40%, and it is barely as slow as when run on CPU. I am thus wondering about the prefetch operation.
If I understand correctly, it will create a special thread that buffers N examples. But what does it mean ? What happens to the other examples not buffered ?
is there an optimal relation between the prefetch buffer size, the number of examples in the complete Dataset and the batch size ? In the NMT code, prefetch buffer size is set at 1000*batch_size, but why ? If e.g. I am using 10000 examples, a batch size of 100, what should be the prefetch buffer size ?
Any other advice regarding Dataset speedup would be appreciated.

Apparently, Dataset API runs on CPU and not on GPU, so this answers the question.

Related

Dataset does not fit in memory

I have an MNIST like dataset that does not fit in memory, (process memory, not gpu memory).
My dataset is 4GB.
This is not a TFLearn issue.
As far as I know model.fit requires an array for x and y.
TFLearn example:
model.fit(x, y, n_epoch=10, validation_set=(val_x, val_y))
I was wondering is there's a way where we can pass a "batch iterator", instead of an array.
Basically for each batch I would load the necessary data from disk.
This way I would not run into process memory overflow errors.
EDIT
np.memmap could be an option. But I don't see how to skip the first few bytes that compose the header.
You can use the Dataset api.
"The Dataset API supports a variety of file formats so that you can process large datasets that do not fit in memory"
Basically the input pipeline would become part of your graph.
If memory is still an issue then you can use a generator to create your tf.data.Dataset. Further, you could potentially make the process quicker by preparing tfrecords to create your Dataset.

Maximize Tensorflow Performance

I'm using Tensorflow 1.2. for image segmentation on an AWS p2 instance (Tesla K80). Is there an easy way for me to find out if I can improve the performance of my code?
Here is what I know:
I measured the execution time of the various parts of my program and
99% of the time is spent calling session run.
sess.run([train_op, loss, labels_modified, output_modified],
feed_dict=feed_dict)
where feed_dict is a mapping from placeholders to tensors.
The session.run method only takes 0.43 seconds to execute for the following parameters: batch_size=1, image_height=512, image_width=512, channels=3.
The network has 14 convolutional layers (no dense layers) with a total of 11 million trainable parameters.
Because I'm doing segmentation I use a batch size of 1 and then compute the pixel-wise loss (512*512 cross entropy losses).
I tried to compile Tensorflow from source and got zero performance improvements.
I read through the performance guide https://www.tensorflow.org/performance/performance_guide but I don't want to spend a lot of time trying all of these suggestions. It already took me 8 hours to compile Tensorflow and it gave me zero benefits!
How can I find out which parts of the session run take most of the time? I have a feeling that it might be the loss calculation.
And is there any clear study that shows how much speedup I can expect from the things mentioned in the performance guide?
You're performing a computationally intensive task that requires a lot of calculations and a lot of memory. Your model has a lot of parameters and each one requires to be computed forward, backward and updated.
The suggestions in the page you linked are OK and if you followed them all there's nothing else you can do, except creating another (1 or more) instance and run the train in parallel. This will give you a Nx speed up (where N is the number of instances that compute the gradients for your input batch) but it's extremely expensive and not always applicable (moreover it requires to change you code in order to make it follow the client-server architecture for the gradient computation and weight updates)
Based on your small piece of code, I see you're using a feed dictionary. Generally it's best to avoid using feed dictionaries if queues can be used (see https://github.com/tensorflow/tensorflow/issues/2919). The Tensorflow documentation covers the use of queues here. Switching to queues will definitely improve your performance.
Maybe you can run your code with tfprof to do some profiling to find out where the bottleneck is.
For just guessing, the performance problem may caused by feeding data. Don't how did you prepare your feed_dict, if you have to read you data from disk for preparing your feed_dict for every sess.run, it will slow the program for reading data and training is in synchronous. you can try to covert you data to tfrecords, make loading data and training in asynchronous by using tf.FIFOQueue

Best settings for bulk load in graphdb

I have been going through documentation but I am unable to identify what the general guidelines are for bulk loading.
As far as I can see the best way to bulk load data into graphdb is by using the LoadRDF tool.
However the general rules for the appropriate settings are not familiar to me.
First of all if you have an "average" server with an SSD drive what kind of parsing speed is acceptable? 1.000 statements / sec, 10.000 statements / sec or is it much more or less?
Also what are good settings? For example you can set the -Dpool.buffer.size which has a default of 200.000 statements but if you have 10gig of ram what would be the rule of thumb to increase this and if you have 100 or 300 gig of ram?
Another option is the -Dinfer.pool.size which is set to the maximum of threads as there are cpus with a minimum of 4. Thus 1 core = 4 threads and 32 cores is 32 threads. I think this does not require any extra tuning or is this only there if you want to reduce the CPU load and not overshoot to lets say 64 threads if you have 32 cores?
There are also extra options available through the turtle file with examples in configs/templates where perhaps owlim:cache-memory and owlim:tuple-index-memory could be useful during loading and the other settings more useful for after loading?
In the end does it also matter if you have 100's of individual files instead of one big turtle file and / or does compressing the files increase loading speed or does it only reduce the initial disk usage?
For me personally, I currently have a setup of 290gb ram and 32 cores and 1.8T raid 0 SSD drives (which will have a backup after loading) and trying to do an initial load of 3 billion triples, from SSD to same SSD, which with the global speed of 16.461 statements per second will take a while but I am not sure if and how to improve this.
The best place to get a reference to the standard data loading speed is the GraphDB benchmark page.
From a computational point of view, the data loading process consists in generating unique internal IDs for all RDF resources and indexing all statements in multiple sorted collections like PSOC, POSC and CPSO (if context indexes are enabled). This process is mainly affected by:
Reasoning complexity - the database integrates a forward chaining inference engine. This means that for every newly added statement a predefined set of rules is fired recursively. Depending on the particular dataset and the configured rules, the number of materialised implicit statements may increase dramatically. The data loading speed is affected by the number of indexed statements, but not input explicit triples.
Size of the dataset - with the increase of the numbered indexed statements in each collection, the time to add more data also increases. The main two factors are the logarithmic complexity of the sorted collection, and the number of page splits because of the random coming IDs in at least one of the collections.
The number of CPU cores will speed up the data loading only if there is inference. The import of every new file will have a minimal overhead, so this should not be a concern unless their size is considerable. For the heap size, we have found best that a combination between SSD and a heap size limited to 30GB works best. If you restrict the heap size to 30GB, then you can benefit from the XX:+UseCompressedOops and still have a reasonable GC time.
Please note that GraphDB 8.x will also reserve off heap space for immutable data structures like the mapping of RDF resources to internal IDs! For a 3B dataset, it may become as big as 15GB. The main reason behind this design decision is to save GC time.

How is circular buffer used for spilling process in hadoop?

From "Hadoop the definitive guide"
[Each map task has a circular memory buffer that it writes the output to. The buffer is
100 MB by default, a size that can be tuned by changing the io.sort.mb property. When
the contents of the buffer reaches a certain threshold size (io.sort.spill.percent,
which has the default 0.80, or 80%), a background thread will start to spill the contents
to disk]
Question here is that since each map task works on a single input split (which more or less would be equal to the size of HDFS block i.e 64 MB), the condition for spilling back to the disk shall never arose. Am i missing something. Please help.
Why do you assume the Split Size or the Block size would be 64 MB? Practically I have seen having a small Block size reduces the performance (For the scale of data I analyse). I have seen better performance with block size/ split size of 256MB in my use case.
Coming back to your question,
Having Way too many Mappers is also an overhead on the framework. Going by the use case mentioned in the question we might not be spilling keys,values from the 100 MB circular Buffer. But consider these case where split-size is 64MB and the Mapper does some calculations based on the input and emits additional calculation results as a part of Map output, there are chances that the Map output can be more than the configured circular buffer size. Another use case we have 64 MB of block-compressed data the data just bursts up in size when processing. Considers mappers which will fetch additional data from "Side Data Distribution", "distributed cache" in the Map phase.
Just an additional Note:
From my experience I can clearly say that when we work on/with a framework the default configurations will never suit our requirements. We need to tweak and tune the system to give us the best possible performance.

Compression to Improve Hard Disk Write Performance

On a modern system can local hard disk write speeds be improved by compressing the output stream?
This question derives from a case I'm working with where a program serially generates and dumps around 1-2GB of text logging data to a raw text file on the hard disk and I think it is IO bound. Would I expect to be able to decrease runtimes by compressing the data before it goes to disk or would the overhead of compression eat up any gain I could get? Would having an idle second core affect this?
I know this would be affected by how much CPU is being used to generate the data so rules of thumb on how much idle CPU time would be needed would be good.
I recall a video talk where someone used compression to improve read speeds for a database but IIRC compressing is a lot more CPU intensive than decompressing.
Yes, yes, yes, absolutely.
Look at it this way: take your maximum contiguous disk write speed in megabytes per second. (Go ahead and measure it, time a huge fwrite or something.) Let's say 100mb/s. Now take your CPU speed in megahertz; let's say 3Ghz = 3000mhz. Divide the CPU speed by the disk write speed. That's the number of cycles that the CPU is spending idle, that you can spend per byte on compression. In this case 3000/100 = 30 cycles per byte.
If you had an algorithm that could compress your data by 25% for an effective 125mb/s write speed, you would have 24 cycles per byte to run it in and it would basically be free because the CPU wouldn't be doing anything else anyway while waiting for the disk to churn. 24 cycles per byte = 3072 cycles per 128-byte cache line, easily achieved.
We do this all the time when reading optical media.
If you have an idle second core it's even easier. Just hand off the log buffer to that core's thread and it can take as long as it likes to compress the data since it's not doing anything else! The only tricky bit is you want to actually have a ring of buffers so that you don't have the producer thread (the one making the log) waiting on a mutex for a buffer that the consumer thread (the one writing it to disk) is holding.
Yes, this has been true for at least 10 years. There are operating-systems papers about it. I think Chris Small may have worked on some of them.
For speed, gzip/zlib compression on lower quality levels is pretty fast; if that's not fast enough you can try FastLZ. A quick way to use an extra core is just to use popen(3) to send output through gzip.
For what it is worth Sun's filesystem ZFS has the ability to have on the fly compression enabled to decrease the amount of disk IO without a significant increase in overhead as an example of this in practice.
The Filesystems and storage lab from Stony Brook published a rather extensive performance (and energy) evaluation on file data compression on server systems at IBM's SYSTOR systems research conference this year: paper at ACM Digital Library, presentation.
The results depend on the
used compression algorithm and settings,
the file workload and
the characteristics of your machine.
For example, in the measurements from the paper, using a textual workload and a server environment using lzop with low compression effort are faster than plain write, but bzip and gz aren't.
In your specific setting, you should try it out and measure. It really might improve performance, but it is not always the case.
CPUs have grown faster at a faster rate than hard drive access. Even back in the 80's a many compressed files could be read off the disk and uncompressed in less time than it took to read the original (uncompressed) file. That will not have changed.
Generally though, these days the compression/de-compression is handled at a lower level than you would be writing, for example in a database I/O layer.
As to the usefulness of a second core only counts if the system will be also doing a significant number of other things - and your program would have to be multi-threaded to take advantage of the additional CPU.
Logging the data in binary form may be a quick improvement. You'll write less to the disk and the CPU will spend less time converting numbers to text. It may not be useful if people are going to be reading the logs, but they won't be able to read compressed logs either.
Windows already supports File Compression in NTFS, so all you have to do is to set the "Compressed" flag in the file attributes.
You can then measure if it was worth it or not.
This depends on lots of factors and I don't think there is one correct answer. It comes down to this:
Can you compress the raw data faster than the raw write performance of your disk times the compression ratio you are achieving (or the multiple in speed you are trying to get) given the CPU bandwidth you have available to dedicate to this purpose?
Given today's relatively high data write rates in the 10's of MBytes/second this is a pretty high hurdle to get over. To the point of some of the other answers, you would likely have to have easily compressible data and would just have to benchmark it with some test of reasonableness type experiments and find out.
Relative to a specific opinion (guess!?) to the point about additional cores. If you thread up the compression of the data and keep the core(s) fed - with the high compression ratio of text, it is likely such a technique would bear some fruit. But this is just a guess. In a single threaded application alternating between disk writes and compression operations, it seems much less likely to me.
If it's just text, then compression could definitely help. Just choose an compression algorithm and settings that make the compression cheap. "gzip" is cheaper than "bzip2" and both have parameters that you can tweak to favor speed or compression ratio.
If you are I/O bound saving human-readable text to the hard drive, I expect compression to reduce your total runtime.
If you have an idle 2 GHz core, and a relatively fast 100 MB/s streaming hard drive,
halving the net logging time requires at least 2:1 compression and no more than roughly 10 CPU cycles per uncompressed byte for the compressor to ponder the data.
With a dual-pipe processor, that's (very roughly) 20 instructions per byte.
I see that LZRW1-A (one of the fastest compression algorithms) uses 10 to 20 instructions per byte, and compresses typical English text about 2:1.
At the upper end (20 instructions per byte), you're right on the edge between IO bound and CPU bound. At the middle and lower end, you're still IO bound, so there is a a few cycles available (not much) for a slightly more sophisticated compressor to ponder the data a little longer.
If you have a more typical non-top-of-the-line hard drive, or the hard drive is slower for some other reason (fragmentation, other multitasking processes using the disk, etc.)
then you have even more time for a more sophisticated compressor to ponder the data.
You might consider setting up a compressed partition, saving the data to that partition (letting the device driver compress it), and comparing the speed to your original speed.
That may take less time and be less likely to introduce new bugs than changing your program and linking in a compression algorithm.
I see a list of compressed file systems based on FUSE, and I hear that NTFS also supports compressed partitions.
If this particular machine is often IO bound,
another way to speed it up is to install a RAID array.
That would give a speedup to every program and every kind of data (even incompressible data).
For example, the popular RAID 1+0 configuration with 4 total disks gives a speedup of nearly 2x.
The nearly as popular RAID 5 configuration, with same 4 total disks, gives all a speedup of nearly 3x.
It is relatively straightforward to set up a RAID array with a speed 8x the speed of a single drive.
High compression ratios, on the other hand, are apparently not so straightforward. Compression of "merely" 6.30 to one would give you a cash prize for breaking the current world record for compression (Hutter Prize).
This used to be something that could improve performance in quite a few applications way back when. I'd guess that today it's less likely to pay off, but it might in your specific circumstance, particularly if the data you're logging is easily compressible,
However, as Shog9 commented:
Rules of thumb aren't going to help
you here. It's your disk, your CPU,
and your data. Set up a test case and
measure throughput and CPU load with
and without compression - see if it's
worth the tradeoff.

Resources