Writing a large file prevents large block DMA allocation - memory-management

I'm working with a board with an ARM based processor running linux (3.0.35). Board has 1GB RAM and is connected to a fast SSD HD, and to a 5MP camera.
My goal is capturing high resolution images and write those directly to disk.
All goes well until I'm trying to save a very long video (over 1GB of data),
After saving a large file, it seems that I'm unable to reload the camera driver - it fails allocating a large enough DMA memory block for streaming (when calling dma_alloc_coherent()).
I narrowed it down to a scenario where Linux boots (when most of the memory is available), I then write random data into a large file (>1GB), and when I try to load the camera driver it fails.
To my question -
When I open a file for writing, write a large amount of data, and close the file, isn't the memory which was used for writing the data to HD supposed to be freed?
I can understand the why the memory becomes fragmented during the HD access, but when the transactions to the HD are completed - why is the memory still so fragmented that I cannot allocate 15MB of contiguous RAM?
Thanks

[...] close the file, isn't the memory which was used for writing the data to HD supposed to be freed?
No, it will be cached, you can check /proc/meminfo for this. Whether the dma_alloc_coherent() function uses only free memory is a good question.

Related

Memory copy is taking more time on GPU compared to CPU

I have a source and destination pointers of the image to copy. When I run the code for the copy on CPU, its taking 2ms.
Now,I ran code on open cl with:
clCreateBuffer(context,CL_MEM_USE_HOST_PTR|CL_MEM_READ_WRITE,size,src_ptr,errcode_ret)
clCreateBuffer(context,CL_MEM_USE_HOST_PTR|CL_MEM_READ_WRITE,size,dst_ptr,errcode_ret)
and written kernel with global workgroup size(w,H).so, each kernel is copying a pixel. It's about 20ms.
Can someone please help me, how to efficiently do memory copy on open cl when we have image pointers to global memory.what is proper workgroup size to use for this process?
Can you help clarify what you're trying to accomplish? Are you trying to compare the time it takes to memcpy a host buffer to the time it takes to copy a device buffer using a GPU kernel?
If so, try allocating the buffer without the CL_MEM_USE_HOST_PTR flag. From the first response here it seems like some implementations map that buffer to system memory instead of device memory, which could slow down the copy kernel.
how to efficiently do memory copy on open cl when we have image
pointers to global memory
The efficient way is to use memcpy() on the host pointers. IOW use the CPU.
when we use CL_MEM_USE_HOST_PTR, GPU can access the image directly from global memory instead of copying from global memory
That's not strictly true. It's true for integrated GPUs (if the host_ptr memory pointer is properly aligned). Discrete GPUs will still copy host memory to their own memory over the PCI express bus. If you read the documentation for clCreateBuffer, it says:
CL_MEM_USE_HOST_PTR ... OpenCL implementations are allowed to cache the buffer contents pointed to by host_ptr in device memory. This cached copy can be used when kernels are executed on a device.
Discrete GPUs cannot directly "work" on host memory. Even if they could, it would be so slow as to be pointless.
In fact using CL_MEM_USE_HOST_PTR with a discrete GPU may result in worse performance, because the GPU will have to keep the host copy in sync with its own copy, which will result in a lot of PCIe transfers. CL_MEM_USE_HOST_PTR only makes sense with integrated GPUs to save unnecessary transfers and memory copies.
Generally the way you work with GPUs is to minimize memory transfers, so you create buffers once (with clCreateBuffer), then launch the kernels you need on them, and then either transfer result back to host (via enqueueReadImage) or display it with OpenGL interop. You'll have to clarify what you're doing if you want more useful advice.

Will Windows be still able to allocate memory when free space in physical memory is very low?

On Windows 32-bit system the application is being developed using Visual Studio:
Lets say lots of other application running on my machine and they have occupied almost all of physical memory and only 1 MB memory is left free. If my application (which has not yet allocated any memory) tries to allocate, say 2 MB, will the call be successful?
My guess: In theory, each Windows application has 2GB of virtual memory available.
So I believe this call should be successful (regardless how much physical memory is available). But I am not sure on this. That's why asking here.
Windows gives a rock-hard guarantee that this will always work. A process can only allocate virtual memory when Windows can commit space in the paging file for the allocation. If necessary, it will grow the paging file to make the space available. If that fails, for example when the paging file grows beyond the preset limit, then the allocation fails as well. Windows doesn't have the equivalent of the Linux "OOM killer", it does not support over-committing that may require an operating system to start randomly killing processes to find RAM.
Do note that the "always works" clause does have a sting. There is no guarantee on how long this will take. In very extreme circumstances the machine can start thrashing where just about every memory access in the running processes causes a page fault. Code execution slows down to a crawl, you can lose control with the mouse pointer frozen when Explorer or the mouse or video driver start thrashing as well. You are well past the point of shopping for RAM when that happens. Windows applies quotas to processes to prevent them from hogging the machine, but if you have enough processes running then that doesn't necessarily avoid the problem.
Of course. It would be lousy design if memory had to be wasted now in order to be used later. Operating systems constantly re-purpose memory to its most advantageous use at any moment. They don't have to waste memory by keeping it free just so that it can be used later.
This is one of the benefits of virtual memory with a page file. Because the memory is virtual, the system can allocate more virtual memory than physical memory. Virtual memory that cannot fit in physical memory, is pushed out to the page file.
So the fact that your system may be using all of the physical memory does not mean that your program will not be able to allocate memory. In the scenario that you describe, your 2MB memory allocation will succeed. If you then access that memory, the virtual memory will be paged in to physical memory and very likely some other pages (maybe in your process, maybe in another process) will be pushed out to the page file.
Well, it will succeed as long as there's some memory for it - apart from physical memory, there's also the page file.
However, once you reach the limit of both RAM and the page file, you're done for and that's when the out of memory situation really starts being fun.
Now, systems like Windows Vista will try to use all of your available RAM, pretty much for caching. That's a good thing, and when there's a request for memory from an application, the cache will be thrown away as needed.
As for virtual memory, you can request much more than you have available, regardless of your RAM or page file size. Only when you commit the memory does it actually need some backing - either RAM or the page file. On 64-bit, you can easily request terabytes of virtual memory - that doesn't mean you'll get it when you try to commit it, though :P
If your application is unable to allocate a physical memory (RAM) block to store information, the operating system takes over and 'pages' or stores sections that are in RAM on disk to free up physical memory so that your program is able to perform the allocation. This is done automatically and is completely invisible to your applications.
So, in your example, on a system that has 1MB RAM free, if your application tries to allocate memory, the operating system will page certain contents of physical memory to disk and free up RAM for your application. Your application will not crash in this case.
This, obviously is much more complicated than that.
There are several ways to configure a page file on Windows (fixed size, variable size and on which disk). If you run out of physical memory, and out of hard drive space (because your page file has grown very large due to excessive 'paging') or reach the limit of your paging file (if it is a static limit) then your applications will fail due out an out-of-memory exception. With today's systems with large local storage however, this is a rare event.
Be sure to read about paging for the full picture. Check out:
http://en.wikipedia.org/wiki/Paging
In certain cases, you will notice that you have sufficient free physical memory. Say 100MB and your program tries to allocate a 10MB block to store a large object but fails. This is caused by physical memory fragmentation. Although the total free memory is 100MB, there is no single contiguous block of 10MB that can be used to store your object. This will result in an exception that needs to be handled in your code. If you allocate large objects in your code you may want to separate the allocation into smaller blocks to facilitate allocation, and then aggregate them back in your code logic. For example, instead of having a single 10m vector, you can declare 10 x 1m vectors in an array and allocate memory for each individual one.

Imread & Imwrite do not achieve expected gains on a Ramdisk

I have written a particular image processing algorithm that makes heavy use of imwrite and imread. The following example will run simultaneously on eight Matlab sessions on a hyper-threading-enabled 6-core i7 machine. (Filenames are different for each session.)
tic;
for i=1:1000
%a processing operation will be put here%
imwrite(imgarray,temp,'Quality',100);
imgarray=imread(temp);
end
toc;
I'm considering temp=[ramdrive_loc temp]; change in the example code for two purposes:
Reducing time consumption
Lowering hard drive wearing
Image files created are about 1 Mb in size. Hard drives are formed as RAID0 with 2 x 7.2k Caviar Blacks. The machine is a Windows machine, in which partitions are formatted as NTFS.
The outputs of toc from above are (without processing images) :
Without Ramdisk: 104.330466 seconds.
With Ramdisk: 106.100880 seconds.
Is there anything that causes me not to gain any speed? Would changing file system of the ramdisk to FAT32 help?
Note: There were other questions regarding ramdisk vs. harddisk comparisons; however this question is mostly about imread, imwrite, and Matlab I/O.
Addition: The ram disk is set up through a free software from SoftPerfect. It has 3gb space, which is more than adequate for task (maximum of 10mb is to be generated and written over and over during Matlab sessions).
File caching. Probably, Windows' file cache is already speeding up your I/O activity here, so the RAM disk isn't giving you an additional speedup. When you write out the file, it's written to the file cache and then asynchronously flushed to the disk, so your Matlab code doesn't have to wait for the physical disk writes to complete. And when you immediately read the same file back in to memory, there's a high chance it's still present in the file cache, so it's served from memory instead of incurring a physical disk read.
If that's your actual code, you're re-writing the same file over and over again, which means all the activity may be happening inside the disk cache, so you're not hitting a bottleneck with the underlying storage mechanism.
Rewrite your test code so it looks more like your actual workload: writing to different files on each pass if that's what you'll be doing in practice, including the image processing code, and actually running multiple processes in parallel. Put it in the Matlab profiler, or add finer-grained tic/toc calls, to see how much time you're actually spending in I/O (e.g. imread and imwrite, and the parts of them that are doing file I/O). If you're doing nontrivial processing outside the I/O, you might not see significant, if any, speedup from the RAM disk because the file cache would have time to do the actual physical I/O during your other processing.
And since you say there's a maximum of 10 MB that gets written over and over again, that's small enough that it could easily fit inside the file cache in the first place, and your actual physical I/O throughput is pretty small: if you write a file, and then overwrite its contents with new data before the file cache flushes it to disk, the OS never has to flush that first set of data all the way to disk. Your I/O might already be mostly happening in memory due to the cache so switching to a RAM disk won't help because physical I/O isn't a bottleneck.
Modern operating systems do a lot of caching because they know scenarios like this happen. A RAM disk isn't necessarily going to be a big speedup. There's nothing specific to Matlab or imread/imwrite about this behavior; the other RAM disk questions like RAMdisk slower than disk? are still relevant.

File reading and caching

When a file is read in Java (or any language), is the data copied from disk to memory outside of the application-level buffer? For example, how many copies of the data are made when I do the following:
FileInputStream fileReader = new FileInputStream(new File("/path/to/file"));
byte[] buffer = new byte[4096];
fileReader.read(buffer);
Other than the copy of the data written from disk to the buffer, is the data also cached by the operating system or file system?
Short Answer
Maybe
Long Answer
It depends on the operating system and the filesystem chosen how many copies of any particular data are created when reading from a disk or disk-like device. All modern desktop filesystems have a read/write buffer that caches data between the application level and the physical device level. Mobile devices and embedded devices usually don't have this layer because they are writing to a memory based device and not a physical spinning disk.
I think as SSD devices get bigger and cheaper that this level of caching on desktop devices will get much smaller, or go away completely as the SSD devices don't have the same speed issues as spinning disks do. They are still slower than main memory, but they should not require the aggressive caching that is done because of the slow access speed of spinning media.

RAMdisk slower than disk?

A python program I created is IO bounded. The majority of the time (over 90%) is spent in a single loop which repeats ~10,000 times. In this loop, ~100KB data is generated and written to a temporary file; it is then read back out by another program and statistics about that data collected. This is the only way to pass data into the second program.
Due to this being the main bottleneck, I thought that moving the location of the temporary file from my main HDD to a (~40MB) RAMdisk (inside of over 2GB of free RAM) would greatly increase the IO speed for this file and so reduce the run-time. However, I obtained the following results (each averaged over 20 runs):
Test data 1: Without RAMdisk - 72.7s, With RAMdisk - 78.6s
Test data 2: Without RAMdisk - 223.0s, With RAMdisk - 235.1s
It would appear that the RAMdisk is slower that my HDD.
What could be causing this?
Are there any other alternative to using a RAMdisk in order to get faster file IO?
Your operating system is almost certainly buffering/caching disk writes already. It's not surprising the RAM disk is so close in performance.
Without knowing exactly what you're writing or how, we can only offer general suggestions. Some ideas:
If you have 2 GB RAM you probably have a decent processor, so you could write this data to a filesystem that has compression. That would trade I/O operations for CPU time, assuming your data is amenable to that.
If you're doing many small writes, combine them to write larger pieces at once. (Can we see the source code?)
Are you removing the 100 KB file after use? If you don't need it, then delete it. Otherwise the OS may be forced to flush it to disk.
Can you write the data out in batches rather than one item at a time? Are you caching resources like open file handles etc or cleaning those up? Are your disk writes blocking, can you use background threads to saturate IO while not affecting compute performance.
I would look at optimising the disk writes first, and then look at faster disks when that is complete.
I know that Windows is very aggressive about caching disk data in RAM, and 100K would fit easily. The writes are going directly to cache and then perhaps being written to disk via a non-blocking write, which allows the program to continue. The RAM disk probably wouldn't support non-blocking operations because it expects those operations to be quick and not worth the bother.
By reducing the amount of memory available to programs and caching, you're going to increase the amount of disk I/O for paging even if only slightly.
This is all speculation on my part, since I'm not familiar with the kernel or drivers. I also speculate that Linux would operate similarly.
In my tests I've found that not only batch size affects overall performance, but also the nature of data itself. I've managed to get 5 times better write times compared to SSD in only one scenario: writing a 100MB chunk of pre-cooked random byte array to RAM drive. Writing more "predictable" data like letters "aaa" or current datetime yields quite opposite results - SSD is always faster or equal. So my guess is that opertating system (Win 7 in my case) does lots of caching and optimizations.
Looks like the most hindering case for RAM-drive is when you perform lots of small writes instead of a few big ones, and RAM drive shines at writing large amounts of hard-to-compress data.
I had the same mind boggling experience, and after many tries I figured it out.
When ramdisk is formatted as FAT32, then even though benchmarks shows high values, real world use is actually slower than NTFS formatted SSD.
But NTFS formatted ramdisk is faster in real life than SSD.
I join the people having problems with RAM disk speeds (only on Windows).
The SSD i have can write 30 GiB (in one big block, dump a 30GiB RAM ARRAY) with a speed of 550 MiB/s (arround 56 seconds to write 30 GiB) ... this is if the write is asked in one source code sentence.
The RAM Disk (imDisk) i have can write 30 GiB write (in one big block, dump a 30GiB RAM ARRAY) with a speed of a bit less than 100 MiB/s (arround 5 minutes and 13 seconds to write 30 GiB) ... this is if the write is asked in one source code sentence.
I had also done another RAM test: from source code do a sequential direct write (one byte per source code loop pass) to a 30GiB RAM ARRAY (i have 64GiB of RAM) and i get a speed of near 1.3GiB/s (1298 MiB per second).
Why on the hell (on Windows) RAM Disk is so slow for one BIG secuential write?
Of course that low write speed happens on RAM disks on Windows, since i tested the same 'concept' on Linux with Linux native ram disk and Linux ram disk can write at near one gigabyte per second.
Please note that i had also tested SoftPerfect and other RAM disks on Windows, RAM Disk speeds are near the same, can not write at more than one hundred megabytes per second.
Actual Windows tested: 10 & 11 (on both HOME & PRO, on 64 bits), RAM Disk format (exFAT & NTFS); since RAM disk speed was too slow i was trying to find one Windows version where RAM disk speed be normal, but found no one.
Actual Linux Kernel tested: Only 5.15.11, since Linux native RAM disk speed was normal i do not test on any other kernel.
Hope this help other people, since knowledge is the base to solve a problem.

Resources