RAM Usage Keeps Going Up While Training an RL Network Using RLLib and TensorFlow - memory-management

I have been using older versions of Ray and TensorFlow, but recently transitioned to the following most up-to-date versions on a Linux Ubuntu 20.04 setup.
ray==2.0.0
tensorflow==2.10.0
cuDNN==8.1
CUDA==11.2
While training a single-agent network, I have been experiencing some issues with exceeding RAM utilization. See TensorBoard screenshot below of ram_util_percent. My training session keeps crashing, and this behavior was not there with earlier versions of ray and tensorflow.
Below are the things I have tried so far:
Set the argument reuse_actors = True in ray.tune.run()
Limited object_store_memory to a certain amount, currently at 0.25 Gb
According to this and this, set core file size to unlimited, and increased open files count
None of these methods have helped so far. As a temporary workaround, I am calling Python's garbage collector to free up unused memory when memory usage reaches 80%. I am not sure if this method will continue to mitigate the issue if I go for higher time steps of training; my guess is no.
def collectRemoveMemoryGarbage(self, percThre = 80.0):
"""
:param percThre: PERCENTAGE THRESHOLD FLOATING POINT VALUE
:return: N/A
"""
if psutil.virtual_memory().percent >= percThre:
_ = gc.collect()
Does anyone know a better approach to this problem? I know this is a well discussed problem in the ray GitHub issues page. This might end up being an annoying bug in ray or tensorflow, but I am looking for feedback from others who are well-versed in this area.

Increasing the checkpoint_freq argument within ray.tune.run() helped me achieve 5e6 time steps without any crash due to running out of memory; previously, it was 10, now it's 50.
It seems that not check-pointing frequently does the trick.
I will try out higher number of time steps.

Related

Maximize Tensorflow Performance

I'm using Tensorflow 1.2. for image segmentation on an AWS p2 instance (Tesla K80). Is there an easy way for me to find out if I can improve the performance of my code?
Here is what I know:
I measured the execution time of the various parts of my program and
99% of the time is spent calling session run.
sess.run([train_op, loss, labels_modified, output_modified],
feed_dict=feed_dict)
where feed_dict is a mapping from placeholders to tensors.
The session.run method only takes 0.43 seconds to execute for the following parameters: batch_size=1, image_height=512, image_width=512, channels=3.
The network has 14 convolutional layers (no dense layers) with a total of 11 million trainable parameters.
Because I'm doing segmentation I use a batch size of 1 and then compute the pixel-wise loss (512*512 cross entropy losses).
I tried to compile Tensorflow from source and got zero performance improvements.
I read through the performance guide https://www.tensorflow.org/performance/performance_guide but I don't want to spend a lot of time trying all of these suggestions. It already took me 8 hours to compile Tensorflow and it gave me zero benefits!
How can I find out which parts of the session run take most of the time? I have a feeling that it might be the loss calculation.
And is there any clear study that shows how much speedup I can expect from the things mentioned in the performance guide?
You're performing a computationally intensive task that requires a lot of calculations and a lot of memory. Your model has a lot of parameters and each one requires to be computed forward, backward and updated.
The suggestions in the page you linked are OK and if you followed them all there's nothing else you can do, except creating another (1 or more) instance and run the train in parallel. This will give you a Nx speed up (where N is the number of instances that compute the gradients for your input batch) but it's extremely expensive and not always applicable (moreover it requires to change you code in order to make it follow the client-server architecture for the gradient computation and weight updates)
Based on your small piece of code, I see you're using a feed dictionary. Generally it's best to avoid using feed dictionaries if queues can be used (see https://github.com/tensorflow/tensorflow/issues/2919). The Tensorflow documentation covers the use of queues here. Switching to queues will definitely improve your performance.
Maybe you can run your code with tfprof to do some profiling to find out where the bottleneck is.
For just guessing, the performance problem may caused by feeding data. Don't how did you prepare your feed_dict, if you have to read you data from disk for preparing your feed_dict for every sess.run, it will slow the program for reading data and training is in synchronous. you can try to covert you data to tfrecords, make loading data and training in asynchronous by using tf.FIFOQueue

TensorFlow - Low GPU usage on Titan X

For a while, I have been noticing that TensorFlow (v0.8) does not seem to fully use the computation power of my Titan X. For several CNNs that I have been running the GPU usage does not seem to exceed ~30%. Typically the GPU utilization is even lower, more like 15%. One particular example of a CNN that shows this behavior is the CNN from DeepMind's Atari paper with Q-learning (see link below for code).
When I see other people of our lab running CNNs written in Theano or Torch the GPU usage is typically 80%+. This makes me wondering, why are the CNNs that I write in TensorFlow so 'slow' and what can I do to make more efficient use of the GPU processing power? Generally, I am interested in ways to profile the GPU operations and discover where the bottlenecks are. Any recommendations how to do this are very welcome since this seems not really possible with TensorFlow at the moment.
Things I did to find out more about the cause of this problem:
Analyzing TensorFlow's device placement, everything seems to be on gpu:/0 so looks OK.
Using cProfile, I have optimized the batch generation and other preprocessing steps. The preprocessing is performed on a single thread, but the actual optimization performed by TensorFlow steps take much longer (see average runtimes below). One obvious idea to increase the speed is by using TFs queue runners, but since the batch preparation is already 20x faster than optimization I wonder whether this is going to make a big difference.
Avg. Time Batch Preparation: 0.001 seconds
Avg. Time Train Operation: 0.021 seconds
Avg. Time Total per Batch: 0.022 seconds (45.18 batches/second)
Run on multiple machines to rule out hardware issues.
Upgraded to the latest versions of CuDNN v5 (RC), CUDA Toolkit 7.5 and reinstalled TensorFlow from sources about a week ago.
An example of the Q-learning CNN for which this 'problem' occurs can be found here: https://github.com/tomrunia/DeepReinforcementLearning-Atari/blob/master/qnetwork.py
Example of NVIDIA SMI displaying the low GPU utilization: NVIDIA-SMI
With the more recent versions of Tensorflow (I am using Tensorflow 1.4), we can obtain runtime statistics and visualize them in Tensorboard.
These statistics include compute time and memory usage for each node in the computation graph.

Tracking down leaks in Ruby daemon

We have a Ruby daemon running on Ruby 2.2.0 & Ubuntu 12.04 that recently started leaking a lot of memory. It grows to about 10GB of usage over 12 hours and then the growth seems to slow down.
We're not quite sure if it's due to a change we made or just that we're giving it more work to process.
We've tried the technique described in this blog post to track down the cause, but the results show that it only found a few megabytes of new objects (and only 150mb of total objects) created when the process had actually grown by a gigabyte.
We've also tried doing a core dump of the process and looking through that, but there don't seem to be any obvious clues there either (Mostly blank space with occasional random-looking binary data).
What are some strategies we can use to track down the issue?

Is there any way to force JavaFX to release video memory?

I'm writing an application leveraging JavaFX that scrolls a large amount of image content on and off of the screen every 20-30 seconds. It's meant to be able to run for multiple hours, pulling in completely new content and discarding old content every couple minutes. I have 512Mb of graphics memory on my system and after several minutes, all of that memory has been consumed by JavaFX and no matter what I do with my JavaFX scene, none of it is released. I've been very careful to discard nodes when they drop off of the scene, and at most I have 50-60 image nodes in memory at one time. I really need to be able to do a hard release of the graphics memory that was backing these images, but haven't been able to figure out how to accomplish that, as the Image interface in JavaFX seems to be very high level. JavaFX will continue to run fine, but other graphics heavy applications will fail to load due to limited resources.
I'm looking for something like the flush() method on java.awt.image.Image:
http://docs.oracle.com/javase/7/docs/api/java/awt/Image.html#flush()
I'm running java 7u13 on Linux.
EDIT:
I managed to work out a potential workaround ( see below ), but have also entered a JavaFX JIRA ticket to request the functionality described above:
RT-28661
Add explicit access to a native resource cleanup function on nodes.
The best workaround that I could come up with was to set my JVM's max heap to half of the available limit of my graphics card. ( I have 512mb of graphics memory, so I set this to -Xmx256m ) This forces the GC to be more proactive in cleaning up my discarded javafx.image.Image objects, which in turn seems to trigger graphics memory cleanup on the part of JavaFX.
Previously my heap space was set to 512mb, ( I have 4gb of system memory, so this is a very manageable limit ). The problem with that seems to be that the JVM was being very lazy about cleaning up my images until it started approaching this 512mb limit. Since all of my image data was copied into graphics memory, this meant I had most likely exhausted my graphics memory before the JVM really started really caring about cleanup.
I did try some of the suggestions by jewelsea:
I am calling setCache(false), so this may be having a positive affect, but I didn't notice an improvement until I dropped my max heap size.
I tried running with Java8 with some positive results. It did seem to behave better in graphics memory management, but it still ate up all of my memory and didn't seem to start caring about graphics memory until I was almost out. If reducing your the application's heap limit is not feasible, then evaluating the Java8 pre-release may be worthwhile.
I will be posting some feature requests to the JavaFX project and will provide links to the JIRA tickets.
Perhaps you are encountering behaviour related to the root cause of the following issue:
RT-16011 Need mechanism for PG nodes to know when they are no longer part of a scene graph
From the issue description:
Some PG nodes contain handles to non-heap resources, such as GPU textures, which we would want to aggressively reclaim when the node is no longer part of a scene graph. Unfortunately, there is no mechanism to report this state change to them so that they can release their resources so we must rely on a combination of GC, Ref queues, and sometimes finalization to reclaim the resources. Lazy reclamation of some of these resources can result in exceptions when garbage collection gets behind and we run out of these limited resources.
There are numerous other related issues you can see when you look at the issue page I linked (signup is required to view the issue, but anybody can signup).
A sample related issue is:
RT-15516 image data associated with cached nodes that are removed from a scene are not aggressively released
On which a user commented:
I found a workaround for my app just settihg up an using of cashe to false for all frequently using nodes. 2 days working without any crashes.
So try calling setCache(false) on your nodes.
Also try using a Java 8 preview release where some of these issues have been fixed and see if it increases the stability of your application. Though currently, even in the Java 8 branch, there are still open issues such as the following:
RT-25323 Need a unified Texture resource management system for Prism
Currently texture resources are managed separately in at least 2 places depending on how it is used; one is a texture cache for images and the other is the ImagePool for RTTs. This approach is flawed in its design, i.e. the 2 caches are unaware of each other and it assumes system has unlimited native resources.
Using a video card with more memory may either reduce or eliminate the issue.
You may also wish to put together a minimal executable example which demonstrates your issue and raise a bug request against the JavaFX Runtime project so that a JavaFX developer can investigate your scenario and see if it is new or a duplicate of a known issue.

Drastic performance inprovement in .NET CF after app gets moved out of the foreground. Why?

I have a large (500K lines) .NET CF (C#) program, running on CE6/.NET CF 3.5 (v.3.5.10181.0). This is running on a FreeScale i.Mx31 (ARM) # 400MHz. It has 128MB RAM, with ~80MB available to applications. My app is the only significant one running (this is a dedicated, embedded system). Managed memory in use (as reported by GC.Collect) is about 18MB.
To give a better idea of the app size, here's some stats culled from .NET CF Remote Performance Monitor after staring up the application:
GC:
Garbage Collections 131
Bytes Collected by GC 97,919,260
Managed Bytes in use after GC 17,774,992
Total Bytes in use after GC 24,117,424
GC Compactions 41
JIT:
Native Bytes Jitted: 10,274,820
Loader:
Classes Loaded 7,393
Methods Loaded 27,691
Recently, I have been trying to track down a performance problem. I found that my benchmark after running the app in two different startup configurations would run in approximately 2 seconds (slow case) vs. 1 second (fast case). In the slow case, the time for the benchmark could change randomly from EXE run to EXE run from 1.1 to 2 seconds, but for any given EXE run, would not change for the life of the application. In other words, you could re-run the benchmark and the time for the test stays the same until you restart the EXE, at which point a new time is established and consistent.
I could not explain the 1.1 to 2x slowdown via any conventional mechanism, or by narrowing the slowdown to any particular part of the benchmark code. It appeared that the overall process was just running slower, almost like a thread was spinning and taking away some of "my" CPU.
Then, I randomly discovered that just by switching away from my app (the GUI loses the foreground) to another app, my performance issue disappears. It stays gone even after returning my app to the foreground. I now have a tentative workaround where my app after startup launches an auxiliary app with a 1x1 size window that kills itself after 5ms. Thus the aux app takes the foreground, then relinquishes it.
The question is, why does this speed up my application?
I know that code gets pitched when a .NET CF app loses the foreground. I also notice that when performing a "GC Heap" capture with .NET CF Remote Performance Monitor, a Code Pitch is logged -- and this also triggers the performance improvement in my app. So I suspect somehow that code pitching is related or even responsible for fixing performance. But I'm at a loss as to figure out how to determine if that is really the case, or even to explain why pitching code could help in this way. Does pitching out lots of code somehow significantly help locality of reference of code pages (that are re-JITted, presumably near each other in memory) enough to help this much? (My benchmark spans probably 3 dozen routines and hundreds of lines of code.)
Most importantly, what can I do in my app to reliably avoid this slower condition. Any pointers to relevant .NET CF / JIT / Code pitching information would be greatly appreciated.
Your app going to the background auto-triggers a GC.Collect, which collects, may compact the GC Heap and may pitch code. Have you checked to see if a manual GC.Collect without going to the background gives the same behavior? It might not be pitching that's giving the perf gain, it might be collection or compaction. If a significant number of dead roots are swept up, walking the root tree may be getting faster. Can't say I've specifically seen this issue, so this is all conjecture.
Send your app a wm_hibernate before your performance loop. Will clean up things
We have a similar issue with our .NET CF application.
Over time, our application progressively slows down, eventually to a halt with what I anticipate is due to high CPU load, or as #wil-s says, as if thread is spinning consuming CPU. The only assumption / conclusion I've made to so far is either we have a rogue thread in our code, or there's an under the cover issue in .NET CF, maybe with the JITter.
Closing the application and re-launching returns our application to normal expected performance.
I am yet to implement code change to test issuing WM_HIBERNATE or launch a dummy app which quits itself (as above) to force a code pitch, but am fairly sure this will resolve our issue based on the above comments. (so many thanks for that)
However, I'm subsequently interested to know whether a root cause was ever found to this specific issue?
Incidentally and seemingly off topic (but bear with me), we're using a Freescale i.MX28 processor and are experiencing unpredictable FlashDisk corruption. Seeing 2K blocks of 0xFFs (erased blocks) in random files located on NAND Flash.
I'm mentioning this as I now believe the CPU and FlashDisk corruption issues are linked, due to this article as well as this one:
https://electronics.stackexchange.com/questions/26720/flash-memory-corruption-due-to-electricals
In the article, #jwygralak67 comments:
I recently worked through a flash corruption issue, on a WinCE system,
as part of a development team. We would sporadically find 2K blocks of
flash that were erased. (All bytes 0xFF) For about 6 months we tested
everything from ESD, to dirty power to EMI and RFI interference, we
bought brand new devices and tracked usage to make sure we weren't
exceeding the erase cycle limit and buring out blocks, we went through
our (application level) software with a fine toothed comb.
In the end it turned out to be an obscure bug in the very low level
flash driver code, which only occurred under periods of heavy CPU
load. The driver came from a 3rd party. We informed them of the issue
we found, but I don't know if they ever released a patch.
Unfortunately, we're yet to make contact with him.
With all of this in mind, potentially if we work around the high CPU load, maybe the corruption will no longer be present. Another conjecture case!
On that assumption however, this doesn't give a firm root cause for either situation, which I'm desperately seeking!
Any knowledge or insight, however small, would be very gratefully received.
#ctacke - we've spoken before regarding OpenNETCF via email, so I'm pleased to see your name!

Resources