load file to memory by block of 256kb - algorithm

I'm studying Blocked sort-based indexing and the algorithm talks about loading files by some block of 32 or 64kb because disk reading is by block so it is efficient.
My first question is how am I supposed to load file by block?buffer reader of 64kb? But if I use java input stream, whether or not this optimization has already beed done and I can just tream the stream?
I actually use apache spark, so whether or not sparkContext.textFile() does this optimization? what about spark streaming?

I don't think on the JVM you have any direct view onto the file system that would make it meaningful to align reads and block-sizes. Also there are various kinds of drives and many different file systems now, and block sizes would most likely vary or even have little effect on the total I/O time.
The best performance would probably be to use java.nio.FileChannel, and then you can experiment with reading ByteBuffers of given block sizes to see if it makes any performance difference. I would guess the only effect you see is that the JVM overhead for very small buffers matters more (extreme case, reading byte by byte).
You may also use the file-channel's map method to get hold of a MappedByteBuffer.

Related

How to speed up libjpeg decompression

We are using libjpeg for JPEG decoding on our small embedded platform. We have problems with speed when we decode large images. For example, image which is 20 MB large and has dimensions of 5000x3000 pixels needs 10 seconds to load.
I need some tips on how to improve decoding speed. On other platform with similar performance, I have the same image load in two seconds.
Best decrease from 14 seconds to 10 seconds we got from using larger read buffer (64 kB instead of default 4 kB). But nothing else helped.
We do not need to display image in full resolution, so we use scale_num and scale_denom to display it in smaller size. But I would like to have more performance. Is it possible to use some kind of multithreading etc.? Different decoding settings? Anything, I ran of ideas.
First - profile the code. You're left with little more than speculation if you cannot definitively identify the bottlenecks.
Next, scour the documentation for libjpeg speedup opportunities. You mentioned scale_num and scale_denom. What about the decompressor's dct_method? I've found that the DCT_FASTEST option is good. There are other options to check: do_fancy_upsampling, do_block_smoothing, dither_mode, two_pass_quantize, etc. Some of all of these may be useful to you, depending on your system, libjpeg version, etc.
If profiling tools are unavailable, there are still some things to try. First, I suspect your bottlenecks are non-CPU related. To confirm, load the uncompressed image into a RAM buffer, then decompress it from there as you have been. Did that significantly improve the decompression time? If so, the culprit would appear to be the read operation from your image storage media. Depending on your system, reading from USB (or SD, etc) can be slow. (Note that I'm assuming a read from external media - although hardware details are scant.) Be sure to optimize relevant bus parameters, as well (SPI clocks, configurations, etc).
If you are reading from something like internal flash (i.e. NAND), there are some other things to inspect. How is your NAND controller configured? Have you ensured that the controller is configured for the fastest operation? Check wait states, timings, etc. Note that bus and/or memory contention can be an issue, too - so inspect their respective configurations, as well.
Finally, if you believe your system is actually CPU bound, this stackoverflow question may be of interest:
Can a high-performance jpeglib-turbo implmentation decompress/compress in <100ms?
Multi-threading could only help the decode process if the target had multiple execution units for true concurrent execution. Otherwise it will just time-slice existing CPU resources. It won't help in any case unless the library were designed of make use of it.
If you built the library from source, you should at first ensure you built it with optimisation switched on, and carefully select the compiler options to match the build to your target and its instruction set to enable the compiler to use SIMD or an FPU for example.
Also you need to consider other possible bottlenecks. Is the 10 seconds just the time to decode or does it include the time to read from a filesystem or network for example? Given the improvement observed when you increased the read buffer size, it seems hghly likly that it is the data read rather than the decode that is limiting in this case.
If in fact the filesystem access is the limiting factor rather then the decode, then there may be some benefit in separating the file read from the decode in a separate thread and passing the data via a pipe or queue or multiple shared memory buffers to the decoder. You may then ensure that the decoder can stream the decode without having to wait for filesystem blocking.
Take a look at libjpeg-turbo. If you have supported hardware then it is generally 2-4 times faster then libjpeg on the same CPU. Tipical 12MB jpeg is decoded in less then 2 seconds on Pandaboard. You can also take a look at speed analisys of various JPEG decoders here.

How is fseek() implemented in the filesystem?

This is not a pure programming question, however it impacts the performance of programs using fseek(), hence it is important to know how it works. A little disclaimer so that it doesn't get closed.
I am wondering how efficient it is to insert data in the middle of the file. Supposing I have a file with 1MB data and then I insert something at the 512KB offset. How efficient would that be compared to appending my data at the end of the file? Just to make the example complete lets say I want to insert 16KB of data.
I understand the answer varies depending on the filesystem, however I assume that the techniques used in common filesystems are quite similar and I just want to get the right notion of it.
(disclaimer: I want just to add some hints to this interesting discussion)
IMHO there are some things to take into account:
1) fseek is not a primary system service, but a library function. To evaluate its performance we must consider how the file stream library is implemented. In general, the file I/O library adds a layer of buffering in user space, so the performance of fseek may be quite different if the target position is inside or outside the current buffer. Also, the system services that the I/O libary uses may vary a lot. I.e. on some systems the library uses extensively the file memory mapping if possible.
2) As you said, different filesystems may behave in a very different way. In particular, I would expect that a transactional filesystem must do something very smart and perhaps expensive to be prepared to a possible rollback of an aborted write operation in the middle of a file.
3) Modern OS'es have very aggressive caching algorithms. An "fseeked" file is likely to be already present in cache, so operations become much faster. But they may degrade a lot if the overall filesystem activity produced by other processes become important.
Any comments?
fseek(...) is a library call, not an OS system call. It is the run-time library that takes care of the actual overhead involved in making a system call to the OS, technically speaking, fseek is indirectly making a call to the system but really it is not (this brings up a clear distinction between the differences between a library call and a system call). fseek(...) is a standard input-output function regardless of the underlying system...however...and this is a big however...
The OS will more than likely to have cached the file in its kernel memory, that is, the direct offset to the location on the disk on where the 1's and 0's are stored, it is through the OS's kernel layers, more than likely, a top-most layer within the kernel that would have the snapshot of what the file is composed of, i.e. data irrespectively of what it contains (it does not care either way, as long as the 'pointers' to the disk structure for that offset to the lcoation on the disk is valid!)...
When fseek(..) occurs, there would be a lot of over-head, indirectly, the kernel delegated the task of reading from the disk, depending on how fragmented the file is, it could be theoretically, "all over the place", that could be a significant over-head in terms of having to, from a user-land perspective, i.e. the C code doing an fseek(...), it could be scattering itself all over the place to gather the data into a "one contiguous view of the data" and henceforth, inserting into the middle of a file, (remember at this stage, the kernel would have to adjust the location/offsets into the actual disk platter for the data) would be deemed slower than appending to the end of the file.
The reason is quite simple, the kernel "knows" what was the last offset was, and simply wipe the EOF marker and insert more data, behind the scenes, the kernel, is having to allocate another block of memory for the disk-buffer with the adjusted offset to the location on the disk following an EOF marker, once the appending of data is completed.
Let us assume the ext2 FS and the Linux OS as an example. I don't think there will be a significant performance difference between a insert and an append. In both cases the files node and offset table must be read, the relevant disk sector mapped into memory, the data updated and at some later point the data written back to disk. What will make a big performance difference in this example is good temporal and spatial locality when accessing parts of the file since this will reduce the number of load/store combos.
As a previous answers says you may be able to speed up both operations if you deal with data writes that exact multiples of the FS block size, in this case you could skip the load stage and just insert the new blocks into the files inode datastrucure. This would not be practical, as you would need low level access to the FS driver, and using it would be very restrictive and not portable.
One observation I have made about fseek on Solaris, is that each call to it resets the read buffer of the FILE. The next read will then always read a full block (8K by default). So if you have a lot of random access with small reads it's a good idea to do it unbuffered (setvbuf with NULL buffer) or even use direct syscalls (lseek+read or even better pread which is only 1 syscall instead of 2). I suppose this behaviour will be similar on other OS.
You can insert data to the middle of file efficiently only if data size is a multiple of FS sector but OSes doesn't provide such functions so you have to use low-level interface to the FS driver.
Inserting data in the middle of the file is less efficient than appending to the end because when inserting you would have to move the data after the insertion point to make room for the data being inserted. Moving these data would involve reading them from disk, writing the data to be inserted and then writing the old data after the inserted data. So you have at least one extra read and write when inserting.

What are the most efficient idioms for streaming data from disk with constant space usage?

Problem Description
I need to stream large files from disk. Assume the files are larger than will fit in memory. Furthermore, suppose that I'm doing some calculation on the data and the result is small enough to fit in memory. As a hypothetical example, suppose I need to calculate an md5sum of a 200GB file and I need to do so with guarantees about how much ram will be used.
In summary:
Needs to be constant space
Fast as possible
Assume very large files
Result fits in memory
Question
What are the fastest ways to read/stream data from a file using constant space?
Ideas I've had
If the file was small enough to fit in memory, then mmap on POSIX systems would be very fast, unfortunately that's not the case here. Is there any performance advantage to using mmap with a small buffer size to buffer successive chunks of the file? Would the system call overhead of moving the mmap buffer down the file dominate any advantages Or should I use a fixed buffer that I read into with fread?
I wouldn't be so sure that mmap would be very fast (where very fast is defined as significantly faster than fread).
Grep used to use mmap, but switched back to fread. One of the reasons was stability (strange things happen with mmap if the file shrinks whilst it is mapped or an IO error occurs). This page discusses some of the history about that.
You can compare the performance on your system with the option --mmap to grep. On my system the difference in performance on a 200GB file is negligible, but your mileage might vary!
In short, I'd use fread with a fixed size buffer. It's simpler to code, easier to handle errors and will almost certainly be fast enough.
Depending on the language you are using, a C-like fread() loop based on a file for which you declared a particular buffer size will require exactly this buffer size, no more no less.
We typically choose a buffer size of 4 to 128 kBytes, there is little gain if any with bigger buffers.
If performance was extremely important, for relatively little gain (and at the risk of re-inventing something), one could consider using a two-thread implementation, whereby one thread reads the file in a set of two buffers, and the other thread perform calculations sequential fashion in one of the buffers at a time. In this fashion the disk access delays can be removed.
mjv is right. You can use double-buffers and overlapped I/O. That way your crunching and the disk reading can be happening at the same time. Then I would profile or stack-shot the crunching to make it as fast as possible. With luck it will be faster than the I/O, so you will end up running the I/O at top speed without pause. Then things like file fragmentation come into the picture.

Optimizing locations of on-disk data for sequential access

I need to store large amounts of data on-disk in approximately 1k blocks. I will be accessing these objects in a way that is hard to predict, but where patterns probably exist.
Is there an algorithm or heuristic I can use that will rearrange the objects on disk based on my access patterns to try to maximize sequential access, and thus minimize disk seek time?
On modern OSes (Windows, Linux, etc) there is absolutely nothing you can do to optimise seek times! Here's why:
You are in a pre-emptive multitasking system. Your application and all it's data can be flushed to disk at any time - user switches task, screen saver kicks in, battery runs out of charge, etc.
You cannot guarantee that the file is contiguous on disk. Doing Aaron's first bullet point will not ensure an unfragmented file. When you start writing the file, the OS doesn't know how big the file is going to be so it could put it in a small space, fragmenting it as you write more data to it.
Memory mapping the file only works as long as the file size is less than the available address range in your application. On Win32, the amount of address space available is about 2Gb - memory used by application. Mapping larger files usually involves un-mapping and re-mapping portions of the file, which won't be the best of things to do.
Putting data in the centre of the file is no help as, for all you know, the central portion of the file could be the most fragmented bit.
To paraphrase Raymond Chen, if you have to ask about OS limits, you're probably doing something wrong. Treat your filesystem as an immutable black box, it just is what it is (I know, you can use RAID and so on to help).
The first step you must take (and must be taken whenever you're optimising) is to measure what you've currently got. Never assume anything. Verify everything with hard data.
From your post, it sounds like you haven't actually written any code yet, or, if you have, there is no performance problem at the moment.
The only real solution is to look at the bigger picture and develop methods to get data off the disk without stalling the application. This would usually be through asynchronous access and speculative loading. If your application is always accessing the disk and doing work with small subsets of the data, you may want to consider reorganising the data to put all the useful stuff in one place and the other data elsewhere. Without knowing the full problem domain it's not possible to to be really helpful.
Depending on what you mean by "hard to predict", I can think of a few options:
If you always seek based on the same block field/property, store the records on disk sorted by that field. This lets you use binary search for O(log n) efficiency.
If you seek on different block fields, consider storing an external index for each field. A b-tree gives you O(log n) efficiency. When you seek, grab the appropriate index, search it for your block's data file address and jump to it.
Better yet, if your blocks are homogeneous, consider breaking them down into database records. A database gives you optimized storage, indexing, and the ability to perform advanced queries for free.
Use memory-mapped file access rather than the usual open-seek-read/write pattern. This technique works on Windows and Unix platforms.
In this way the operating system's virtual memory system will handle the caching for you. Accesses of blocks that are already in memory will result in no disk seek or read time. Writes from memory back to disk are handled automatically and efficiently and without blocking your application.
Aaron's notes are good too as they will affect initial-load time for a chunk that's not in memory. Combine that with the memory-mapped technique -- after all it's easier to reorder chunks using memcpy() than by reading/writing from disk and attempting swapouts etc.
The most simple way to solve this is to use an OS which solves that for you under the hood, like Linux. Give it enough RAM to hold 10% of the objects in RAM and it will try to keep as many of them in the cache as possible reducing the load time to 0. The recent server versions of Windows might work, too (some of them didn't for me, that's why I'm mentioning this).
If this is a no go, try this algorithm:
Create a very big file on the harddisk. It is very important that you write this in one go so the OS will allocate a continuous space on disk.
Write all your objects into that file. Make sure that each object is the same size (or give each the same space in the file and note the length in the first few bytes of of each chunk). Use an empty harddisk or a disk which has just been defragmented.
In a data structure, keep the offsets of each data chunk and how often it is accessed. When it is accessed very often, swap its position in the file with a chunk that is closer to the start of the file and which has a lesser access count.
[EDIT] Access this file with the memory-mapped API of your OS to allow the OS to effectively cache the most used parts to get best performance until you can optimize the file layout next time.
Over time, heavily accessed chunks will bubble to the top. Note that you can collect the access patterns over some time, analyze them and do the reorder over night when there is little load on your machine. Or you can do the reorder on a completely different machine and swap the file (and the offset table) when that's done.
That said, you should really rely on a modern OS where a lot of clever people have thought long and hard to solve these issues for you.
That's an interesting challenge. Unfortunately, I don't know how to solve this out of the box, either. Corbin's approach sounds reasonable to me.
Here's a little optimization suggestion, at least: Place the most-accessed items at the center of your disk (or unfragmented file), not at the start of end. That way, seeking to lesser-used data will be closer by average. Err, that's pretty obvious, though.
Please let us know if you figure out a solution yourself.

How do you write the end of a file opened with FILE_FLAG_NO_BUFFERING?

I am using VB6 and the Win32 API to write data to a file, this functionality is for the export of data, therefore write performance to the disk is the key factor in my considerations. As such I am using the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH options when opening the file with a call to CreateFile.
FILE_FLAG_NO_BUFFERING requires that I use my own buffer and write data to the file in multiples of the disk's sector size, this is no problem generally, apart from the last part of data, which if it is not an exact multiple of the sector size will include character zero's padding out the file, how do I set the file size once the last block is written to not include these character zero's?
I can use SetEndOfFile however this requires me to close the file and re-open it without using FILE_FLAG_NO_BUFFERING. I have seen someone talk about NtSetInformationFile however I cannot find how to use and declare this in VB6. SetFileInformationByHandle can do exactly what I want however it is only available in Windows Vista, my application needs to be compatible with previous versions of Windows.
I believe SetEndOfFile is the only way.
And I agree with Mike G. that you should bench your code with and without FILE_FLAG_NO_BUFFERING. Windows file buffering on modern OS's is pretty darn effective.
I'm not sure, but are YOU sure that setting FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH give you maximum performance?
They'll certainly result in your data hitting the disk as soon as possible, but that sort of thing doesn't actually help performance - it just helps reliability for things like journal files that you want to be as complete as possible in the event of a crash.
For a data export routine like you describe, allowing the operating system to buffer your data will probably result in BETTER performance, since the writes will be scheduled in line with other disk activity, rather than forcing the disk to jump back to your file every write.
Why don't you benchmark your code without those options? Leave in the zero-byte padding logic to make it a fair test.
If it turns out that skipping those options is faster, then you can remove the 0-padding logic, and your file size issue fixes itself.
For a one-gigabyte file, Windows buffering will indeed probably be faster, especially if doing many small I/Os. If you're dealing with files which are much larger than available RAM, and doing large-block I/O, the flags you were setting WILL produce must better throughput (up to three times faster for heavily threaded and/or random large-block I/O).

Resources