Imread & Imwrite do not achieve expected gains on a Ramdisk

Imread & Imwrite do not achieve expected gains on a Ramdisk - windows

I have written a particular image processing algorithm that makes heavy use of imwrite and imread. The following example will run simultaneously on eight Matlab sessions on a hyper-threading-enabled 6-core i7 machine. (Filenames are different for each session.)
tic;
for i=1:1000
%a processing operation will be put here%
imwrite(imgarray,temp,'Quality',100);
imgarray=imread(temp);
end
toc;
I'm considering temp=[ramdrive_loc temp]; change in the example code for two purposes:
Reducing time consumption
Lowering hard drive wearing
Image files created are about 1 Mb in size. Hard drives are formed as RAID0 with 2 x 7.2k Caviar Blacks. The machine is a Windows machine, in which partitions are formatted as NTFS.
The outputs of toc from above are (without processing images) :
Without Ramdisk: 104.330466 seconds.
With Ramdisk: 106.100880 seconds.
Is there anything that causes me not to gain any speed? Would changing file system of the ramdisk to FAT32 help?
Note: There were other questions regarding ramdisk vs. harddisk comparisons; however this question is mostly about imread, imwrite, and Matlab I/O.
Addition: The ram disk is set up through a free software from SoftPerfect. It has 3gb space, which is more than adequate for task (maximum of 10mb is to be generated and written over and over during Matlab sessions).

File caching. Probably, Windows' file cache is already speeding up your I/O activity here, so the RAM disk isn't giving you an additional speedup. When you write out the file, it's written to the file cache and then asynchronously flushed to the disk, so your Matlab code doesn't have to wait for the physical disk writes to complete. And when you immediately read the same file back in to memory, there's a high chance it's still present in the file cache, so it's served from memory instead of incurring a physical disk read.
If that's your actual code, you're re-writing the same file over and over again, which means all the activity may be happening inside the disk cache, so you're not hitting a bottleneck with the underlying storage mechanism.
Rewrite your test code so it looks more like your actual workload: writing to different files on each pass if that's what you'll be doing in practice, including the image processing code, and actually running multiple processes in parallel. Put it in the Matlab profiler, or add finer-grained tic/toc calls, to see how much time you're actually spending in I/O (e.g. imread and imwrite, and the parts of them that are doing file I/O). If you're doing nontrivial processing outside the I/O, you might not see significant, if any, speedup from the RAM disk because the file cache would have time to do the actual physical I/O during your other processing.
And since you say there's a maximum of 10 MB that gets written over and over again, that's small enough that it could easily fit inside the file cache in the first place, and your actual physical I/O throughput is pretty small: if you write a file, and then overwrite its contents with new data before the file cache flushes it to disk, the OS never has to flush that first set of data all the way to disk. Your I/O might already be mostly happening in memory due to the cache so switching to a RAM disk won't help because physical I/O isn't a bottleneck.
Modern operating systems do a lot of caching because they know scenarios like this happen. A RAM disk isn't necessarily going to be a big speedup. There's nothing specific to Matlab or imread/imwrite about this behavior; the other RAM disk questions like RAMdisk slower than disk? are still relevant.

Related

Constant Write Speed to Disk

I'm writing real-time data to an empty spinning disk sequentially. (EDIT: It doesn't have to be sequential, as long as I can read it back as if it was sequential.) The data arrives at a rate of 100 MB/s and the disks have an average write speed of 120 MB/s.
Sometimes (especially as free space starts to decrease) the disk speed goes under 100 MB/s depending on where on the platter the disk is writing, and I have to drop vital data.
Is there any way to write to disk in a pattern (or some other way) to ensure a constant write speed close to the average rate? Regardless of how much data there currently is on the disk.
EDIT:
Some notes on why I think this should be possible.
When usually writing to the disk, it starts in the fast portion of the platter and then writes towards the slower parts. However, if I could write half the data to the fast part and half the data to the slow part (i.e. for 1 second it could write 50MB to the fast part and 50MB to the slow part), they should meet in the middle. I could possibly achieve a constant rate?
As a programmer, I am not sure how I can decide where on the platter the data is written or even if the OS can achieve something similar.

If I had to do this on a regular Windows system, I would use a device with a higher average write speed to give me more headroom. Expecting 100MB/s average write speed over the entire disk that is rated for 120MB/s is going to cause you trouble. Spinning hard disks don't have a constant write speed over the whole disk.
The usual solution to this problem is to buffer in RAM to cover up infrequent slow downs. The more RAM you use as a buffer, the longer the span of slowness you can handle. These are tradeoffs you have to make. If your problem is the known slowdown on the inside sectors of a rotating disk, then your device just isn't fast enough.
Another thing that might help is to access the disk as directly as possible and ensure it isn't being shared by other parts of the system. Use a separate physical device, don't format it with a filesystem, write directly to the partitioned space. Yes, you'll have to deal with some of the issues a filesystem solves for you, but you also skip a bunch of code you can't control. Even then, your app could run into scheduling issues with Windows. Windows is not a RTOS, there are not guarantees as far as timing. Again this would help more with temporary slowdowns from filesystem cleanup, flushing dirty pages, etc. It probably won't help much with the "last 100GB writes at 80MB/s" problem.
If you really are stuck with a disk that goes from 120MB/s -> 80MB/s outside-to-inside (you should test with your own code and not trust the specs from the manufacture so you know what you're dealing with), then you're going to have to play partitioning games like others have suggested. On a mechanical disk, that will introduce some serious head seeking, which may eat up your improvement. To minimize seeks, it would be even more important to ensure it's a dedicated disk the OS isn't using for anything else. Also, use large buffers and write many megabytes at a time before seeking to the end of the disk. Instead of partitioning, you could write directly to the block device and control which blocks you write to. I don't know how to do this in Windows.
To solve this on Linux, I would be tempted to test mdadm's raid0 across two partitions on the same drive and see if that works. If so, then the work is done and you don't have to write and test some complicated write mechanism.

Partition the disk into two equally sized partitions. Write a few seconds worth of data alternating between the partitions. That way you get almost all of the usual sequential speed, nicely averaged. One disk seek every few seconds eats up almost no time. One seek per second reduces the usable time from 1000ms to ~990ms which is a ~1% reduction in throughput. The more RAM you can dedicate to buffering the less you have to seek.
Use more partitions to increase the averaging effect.

I fear this may be more difficult than you realize:
If your average 120 MB/s write speed is the manufacturer's value then it is most likely "optimistic" at best.
Even a benchmarked write speed is usually done on a non-partitioned/formatted drive and will be higher than what you'd typically see in actual use (how much higher is a good question).
A more important value is the drive's minimum write speed. For example, from Tom's Hardware 2013 HDD Benchmarks a drive with a 120 MB/s average has a 76 MB/s minimum.
A drive that is being used by other applications at the same time (e.g., Windows) will have a much lower write speed.
An even more important value is the drives actual measured performance. I would make a simple application similar to your use case that writes data to the drive as fast as possible until it fills the drive. Do this a few (dozen) times to get a more realistic average/minimum/maximum write speed value...it will likely be lower than you'd expect.
As you noted, even if your "real" average write speed is higher than 100 MB/s you run into issues if you run into slow write speeds just before the disk fills up, assuming you don't have somewhere else to write the data to. Using a buffer doesn't help in this case.
I'm not sure if you can actually specify a physical location to write to on the hard drive these days without getting into the drive's firmware. Even if you could this would be my last choice for a solution.
A few specific things I would look at to solve your problem:
Measure the "real" write performance of the drive to see if its fast enough. This gives you an idea of how far behind you actually are.
Put the OS on a separate drive to ensure the data drive is not being used by anything other than your application.
Get faster drives (either HDD or SDD). It is fine to use the manufacturer's write speeds as an initial guide but test them thoroughly as well.
Get more drives and put them into a RAID0 (or similar) configuration for faster write access. You'll again want to actually test this to confirm it works for you.

You could implement the strategy of alternating writes bewteen the inside and the outside by directly controlling the disk write locations. Under Windows you can open a disk like "\.\PhysicalDriveX" and control where it writes. For more info see
http://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx

First of all, I hope you are using raw disks and not a filesystem. If you're using a filesystem, you must:
Create an empty, non-sparse file that's as large as the filesystem will fit.
Obtain a mapping from the logical file positions to disk blocks.
Reverse this mapping, so that you can map from disk blocks to logical file positions. Of course some blocks are unavailable due to filesystem's own use.
At this point, the disk looks like a raw disk that you access by disk block. It's a valid assumption that this block addressing is mostly monotonous to the physical cylinder number. IOW if you increase the disk block number, the cylinder number will never decrease (or never increase -- depending on the drive's LBA to physical mapping order).
Also, note that a disk's average write speed may be given per cylinder or per unit of storage. How would you know? You need the latter number, and the only sure way to get it is to benchmark it yourself. You need to fill the entire disk with data, by repeatedly writing a zero page to the disk, going block by block, and divide the total amount of data written by the amount it took. You need to be accessing the disk or the file in the direct mode. This should disable the OS buffering for the file data, and not for the filesystem metadata (if not using a raw disk).
At this point, all you need to do is to write data blocks of sensible sizes at the two extremes of the block numbers: you need to fill the disk from both ends inwards. The size of the data blocks depends on the bandwidth wastage you can allow for seeks. You should also assume that the hard drive might seek once in a while to update its housekeeping data. Assuming a worst-case seek taking 15ms, you waste 1.5% of per-second bandwidth for each seek. Assuming you can spare no more than 5% of bandwidth, with 1 seek/s on average for the drive itself, you can seek twice per second. Thus your block size needs to be your_bandwith_per_second/2. This bandwidth is not the disk bandwidth, but the bandwidth of your data source.
Alas, if only things where this easy. It generally turns out that the bandwidth at the middle of the disk is not the average bandwidth. During your benchmark you must also take a note of write speed over smaller sections of the disk, say every 1% of the disk. This way, when writing into each section of the disk, you can figure out how to split the data between the "low" and the "high" section that you're writing to. Suppose that you're starting out at 0% and 99% positions on the disk, and the low position has a bandwidth of mean*1.5, and the high position has a bandwidth of mean*0.8, where mean is your desired mean bandwidth. You'll then need to write 100% * 1.5/(0.8+1.5) of the data into the low position, and the remainder (100% * 0.8/(0.8+1.5)) into the slower high position.
The size of your buffer needs to be larger than just the block size, since you must assume some worst-case latency for the hard drive if it hits bad blocks and needs to relocate data, etc. I'd say a 3 second buffer may be reasonable. Optionally it can grow by itself if latencies you measure while your software runs turn out higher. This buffer must be locked ("pinned") to physical memory so that it's not subject to swapping.

Another possible option is to destroke (or short stroke) a hard drive. If you start with a 4TB or larger drive and destroke it to 2TB, only the outer portions of the platters will be used, resulting in a faster throughput rate. The issue would be getting the software that issues vendor unique commands to a hard drive to destroke it.

Transferring 1-2 megabytes of data through regular files in Windows - is it slower than through RAM?

I'm passing 1-2 MB of data from one process to another, using a plain old file. Is it significantly slower than going through RAM entirely?
Before answering yes, please keep in mind that in modern Linux at least, when writing a file it is actually written to RAM, and then a daemon syncs the data to disk from time to time. So in that way, if process A writes a 1-2 MB into a file, then process B reads them within 1-2 seconds, process B would simply read the cached memory. It gets even better than that, because in Linux, there is a grace period of a few seconds before a new file is written to the hard disk, so if the file is deleted, it's not written at all to the hard disk. This makes passing data through files as fast as passing them through RAM.
Now that is Linux, is it so in Windows?
Edit: Just to lay out some assumptions:
The OS is reasonably new - Windows XP or newer for desktops, Windows Server 2003 or newer for servers.
The file is significantly smaller than available RAM - let's say less than 1% of available RAM.
The file is read and deleted a few seconds after it has been written.

When you read or write to a file Windows will often keep some or all of the file resident in memory (in the Standby List). So that if it is needed again, it is just a soft-page fault to map it into the processes' memory space.
The algorithm for what pages of a file will be kept around (and for how long) isn't publicly documented. So the short answer is that if you are lucky some or all of it may still be in memory. You can use the SysInternals tool VMmap to see what of your file is still in memory during testing.
If you want to increase your chances of the data remaining resident, then you should use Memory Mapped Files to pass the data between the two processes.
Good reading on Windows memory management:
Mysteries of Windows Memory Management Revealed

You can use FILE_ATTRIBUTE_TEMPORARY to hint that this data is never needed on disk:
A file that is being used for temporary storage. File systems avoid writing data back to mass storage if sufficient cache memory is available, because typically, an application deletes a temporary file after the handle is closed. In that scenario, the system can entirely avoid writing the data. Otherwise, the data is written after the handle is closed.
(i.e. you need use that flag with CreateFile, and DeleteFile immediately after closing that handle).
But even if the file remains cached, you still have to copy it twice: from your process A to the cache (the WriteFile call), and from cache to the proces B (ReadFile call).
Using memory mapped files (MMF, as josh poley already suggested) has the primary advantage of avoiding one copy: the same physical memory pages are mapped into both processes.
A MMF can be backed by virtual memory, which means basically that it always stays in memory unless swapping becomes necessary.
The major downside is that you can't easily grow the memory mapping to changing demands, you are stuck with the initial size.
Whether that matters for an 1-2 MB data transfer depends mostly on how you acquire and what you do with the data, in many scenarios the additional copy doesn't really matter.

Maximize memory for file cache or memory-based file system on Windows?

I have an application which needs to create many small files in maximum performance (less than 1% of them may be read later), and I want to avoid using asynchronous file API to keep the simplicity of my code. The size of total files written cannot be pre-determined, so I figured that to achieve maximum performance, I would need:
1.Windows to utilize all unused RAM for cache (especially for file writes), with no regard of relibility or other issues. If I have more than 1GB of unused RAM and I create one million of 1KB files, I expect Windows to report "DONE" immediately, even if it has written nothing to disk yet.
OR
2.A memory-based file system backed by real on-disk file system. Whenever I write files, it should first write everything in memory only, and then update on-disk file system in background. No delay in synchronous calls unless there isn't enough free memory. Note it's different from tmpfs or RAM disk implementations on Windows, since they require fixed amount of memory and utilize page file when there isn't enough RAM.
For option 1, I have tested VHD files - while it does offer as high as 50% increase of total performance under some configurations, it still does flushing-to-disk and cause writing to wait unnecessarily. And it appears there is no way I can disable journaling or further tweak the write caching behavior.....
For option 2, I have yet found anything similar..... Do such things exist?

RAMdisk slower than disk?

A python program I created is IO bounded. The majority of the time (over 90%) is spent in a single loop which repeats ~10,000 times. In this loop, ~100KB data is generated and written to a temporary file; it is then read back out by another program and statistics about that data collected. This is the only way to pass data into the second program.
Due to this being the main bottleneck, I thought that moving the location of the temporary file from my main HDD to a (~40MB) RAMdisk (inside of over 2GB of free RAM) would greatly increase the IO speed for this file and so reduce the run-time. However, I obtained the following results (each averaged over 20 runs):
Test data 1: Without RAMdisk - 72.7s, With RAMdisk - 78.6s
Test data 2: Without RAMdisk - 223.0s, With RAMdisk - 235.1s
It would appear that the RAMdisk is slower that my HDD.
What could be causing this?
Are there any other alternative to using a RAMdisk in order to get faster file IO?

Your operating system is almost certainly buffering/caching disk writes already. It's not surprising the RAM disk is so close in performance.
Without knowing exactly what you're writing or how, we can only offer general suggestions. Some ideas:
If you have 2 GB RAM you probably have a decent processor, so you could write this data to a filesystem that has compression. That would trade I/O operations for CPU time, assuming your data is amenable to that.
If you're doing many small writes, combine them to write larger pieces at once. (Can we see the source code?)
Are you removing the 100 KB file after use? If you don't need it, then delete it. Otherwise the OS may be forced to flush it to disk.

Can you write the data out in batches rather than one item at a time? Are you caching resources like open file handles etc or cleaning those up? Are your disk writes blocking, can you use background threads to saturate IO while not affecting compute performance.
I would look at optimising the disk writes first, and then look at faster disks when that is complete.

I know that Windows is very aggressive about caching disk data in RAM, and 100K would fit easily. The writes are going directly to cache and then perhaps being written to disk via a non-blocking write, which allows the program to continue. The RAM disk probably wouldn't support non-blocking operations because it expects those operations to be quick and not worth the bother.
By reducing the amount of memory available to programs and caching, you're going to increase the amount of disk I/O for paging even if only slightly.
This is all speculation on my part, since I'm not familiar with the kernel or drivers. I also speculate that Linux would operate similarly.

In my tests I've found that not only batch size affects overall performance, but also the nature of data itself. I've managed to get 5 times better write times compared to SSD in only one scenario: writing a 100MB chunk of pre-cooked random byte array to RAM drive. Writing more "predictable" data like letters "aaa" or current datetime yields quite opposite results - SSD is always faster or equal. So my guess is that opertating system (Win 7 in my case) does lots of caching and optimizations.
Looks like the most hindering case for RAM-drive is when you perform lots of small writes instead of a few big ones, and RAM drive shines at writing large amounts of hard-to-compress data.

I had the same mind boggling experience, and after many tries I figured it out.
When ramdisk is formatted as FAT32, then even though benchmarks shows high values, real world use is actually slower than NTFS formatted SSD.
But NTFS formatted ramdisk is faster in real life than SSD.

I join the people having problems with RAM disk speeds (only on Windows).
The SSD i have can write 30 GiB (in one big block, dump a 30GiB RAM ARRAY) with a speed of 550 MiB/s (arround 56 seconds to write 30 GiB) ... this is if the write is asked in one source code sentence.
The RAM Disk (imDisk) i have can write 30 GiB write (in one big block, dump a 30GiB RAM ARRAY) with a speed of a bit less than 100 MiB/s (arround 5 minutes and 13 seconds to write 30 GiB) ... this is if the write is asked in one source code sentence.
I had also done another RAM test: from source code do a sequential direct write (one byte per source code loop pass) to a 30GiB RAM ARRAY (i have 64GiB of RAM) and i get a speed of near 1.3GiB/s (1298 MiB per second).
Why on the hell (on Windows) RAM Disk is so slow for one BIG secuential write?
Of course that low write speed happens on RAM disks on Windows, since i tested the same 'concept' on Linux with Linux native ram disk and Linux ram disk can write at near one gigabyte per second.
Please note that i had also tested SoftPerfect and other RAM disks on Windows, RAM Disk speeds are near the same, can not write at more than one hundred megabytes per second.
Actual Windows tested: 10 & 11 (on both HOME & PRO, on 64 bits), RAM Disk format (exFAT & NTFS); since RAM disk speed was too slow i was trying to find one Windows version where RAM disk speed be normal, but found no one.
Actual Linux Kernel tested: Only 5.15.11, since Linux native RAM disk speed was normal i do not test on any other kernel.
Hope this help other people, since knowledge is the base to solve a problem.

osx: how do I find the size of the disk i/o cache (write cache, for example)

I am looking to optimize my disk IO, and am looking around to try to find out what the disk cache size is. system_profiler is not telling me, where else can I look?
edit: my program is processing entire volumes: I'm doing a secure-wipe, so I loop through all of the blocks on the volume, reading, randomizing the data, writing... if I read/write 4k blocks per IO operation the entire job is significantly faster than r/w a single block per operation. so my question stems from my search to find the ideal size of a r/w operation (ideal in terms of performance:speed). please do not point out that for a wipe-program I don't need the read operation, just assume that I do. thx.

Mac OS X uses a Unified Buffer Cache. What that means is that in the kernel VM objects and files are them at some level, same thing, and the size of the available memory for caching is entirely dependent on the VM pressure in the rest of the system. It also means the read and write caching is unified, if an item in the read cache is written to it just gets marked dirty and then will be written to disk when changes are committed.
So the disk cache may be very small or gigabytes large, and dynamically changes as the system is used. Because of this trying to determine the cache size and optimize based on it is a losing fight. You are much better off looking at doing things that inform the cache how to operate better, like checking with the underlying device's optimal IO size is, or identifying data that should not be cached and using F_NOCACHE.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio