Reading in a very large file

Reading in a very large file - cocoa

I'm working on an application that reads in a huge text file (can be up to 5gb in size). Currently, I am using fscanf to read in the file, because I have found it to be the fastest so far. However, it still takes quite a large quantity of time to read the whole file in.
Is there a faster way to read in data from a file?

First, you should strongly avoid reading a 5GB file into memory as a single step. The memory impact alone should keep you away from this approach. Instead, you should try to take another approach such as:
Process the data as you read it and throw away the data
Convert the file to a Core Data model prior to work
Convert the file to a fixed-length record format so you can do random-access
Modify the file format so that it is less redundant
Index the file so you can do random-access
Partition the data into separate files
Memory map the file using NSFileWrapper (far from a panacea, but can be useful in conjunction with the above; NSFileWrapper automatically does memory mapping)
You should start by getting a performance baseline:
time cat thebigfile.dat > /dev/null
It is hard to imagine reading the file much faster than that, so that's your floor.
You should definitely do some performance analysis in Instruments and make sure the problem is the reading and not the processing. In particular, memory allocation can be more expensive than you may expect, particularly in a multi-threaded app.
Once you've investigated the above, and you still need really fast management of on-disk data, look at dispatch_io and dispatch_data. This is a really awesome tool for high-speed data management. But it is almost always better to improve your basic algorithms first before worrying about this kind of optimization.

Related

Is it efficient to read, process and write one line at a time?

I am working on a project requires reading a file, making some manipulations on each line and generate a new file. I am a bit concerned about performance. Which algorithm is more efficient? I wrote some pseudocode below.
Store everything to an array, close the file, manipulate each line and store new array to output file:
openInputFile()
lineArray[] = readInput()
closeInputFile()
for (i in lineArray) // i:current line
manipulate i
newArray[] += i // store manipulted line to new array
openOutputFile()
writeOutput(newArray)
closeOutput()
Get each line in a loop, after manipulation write new line to the output
openInputFile()
openOutputFile()
for (i in inputFile) // i:current line
manipulate i
print manipulated line to output
closeInputFile()
closeOutputFile()
Which one should I choose?

It depends on how large the input file is:
If it is small, it doesn't matter which approach you use.
If it is large enough, then the overhead of holding the entire input file and the entire output file in memory at the same time can have significant performance impacts. (Increased paging load, etcetera.)
If it is really large, you will run out of memory and the application will fail.
If you cannot predict the number of lines there will be, then preallocating the line array is problematic.
Provided that you use buffered input and output streams, the second version will be more efficient, will use less memory, and won't break if the input file is too big.

In both cases you read from each file once, and write to each file once. From that perspective, there isn't much difference in efficiency. Filesystems are good at buffering and serialising IO, and your disks are almost always the limiting factor in this sort of thing.
In an extreme case, you do sometimes gain a bit of efficiency with batching your write operations - a single large write is more efficient than lots of small ones. This is very rarely relevant on a modern operating system though, as they'll already be doing that behind the scenes.
So the key difference between the two approaches is memory use - in the former case, you have a much larger memory footprint, and gain no advantage from doing it. You should therefore go for the second choice*.
* Unless you actually need to reference elsewhere in the array, e.g. if you need to sort your data, because you then do need to pull your whole file into memory to manipulate it.

Clojure Time Series Analysis

I have a large data set (200GB uncompressed, 9GB compressed in bz2 -9 ) of stock tick data.
I want to run some basic time series analysis on them.
My machine has 16GB of RAM.
I would prefer to:
keep all data, compressed, in memory
decompress that data on the fly, and stream it [so nothing ever hits disk]
do all analysis in memory
Now, I think there's nice interactions here with Clojure's laziness, and future objects (i.e. I can define objects s.t. when I try to access them, I'll decompress them on the fly.)
Question: what are the things I should keep in mind when doing high performance time series analysis in Clojure?
I'm particular interested in tricks involving:
efficiently storing tick data in memory
efficiently doing computation
weird convolutions to reduce # of passes over the data
Books / articles / research paper suggestions welcome. (I'm a CS PhD student).
Thanks.

Some ideas:
In terms of storing the compressed data, I don't think you will be able to do much better than your OS's own file system caching. Just make sure it s configured to use 11GB+ of RAM for file system caching and it should pull your whole compressed data set into memory as it is read the first time.
You should then be able to define your Clojure code to pull into the data lazily via a ZipInputStream, which will perform the decompression for you.
If you need to perform a second pass on the data, just create a new ZipInputStream on the same file. OS level caching should ensure that you don't hit the disk again.

I have heard of systems like that implemented in Java. It is possible. You'll certainly want to understand how to create your own lazy sequences in order to accomplish this. I also wouldn't hesitate to drop down into Java if you need to make sure that you're dealing with the primitive types that you want to deal with. e.g. Clojure won't generate code to do math on 32-bit ints, it will only generate code to work with longs, and if you don't want that it could be a pain.
It would also be worth some effort to make your in-memory format compatible with a disk format. That would give you the option of memory mapping files, or (at the very least) make your startup easy if your program were to crash. e.g. It could just read the files on disk to recover its previous state.

What reliability guarantees are provided by NTFS?

I wonder what kind of reliability guarantees NTFS provides about the data stored on it? For example, suppose I'm opening a file, appending to the end, then closing it, and the power goes out at a random time during this operation. Could I find the file completely corrupted?
I'm asking because I just had a system lock-up and found two of the files that were being appended to completely zeroed out. That is, of the right size, but made entirely of the zero byte. I thought this isn't supposed to happen on NTFS, even when things fail.

NTFS is a transactional file system, so it guarantees integrity - but only for the metadata (MFT), not the (file) content.

The short answer is that NTFS does metadata journaling, which assures valid metadata.
Other modifications (to the body of a file) are not journaled, so they're not guaranteed.
There are file systems that do journaling of all writes (e.g., AIX has one, if memory serves), but with them, you tend to get a tradeoff between disk utilization and write speed. IOW, you need a lot of "free" space to get decent performance -- they basically just do all writes to free space, and link that new data into the right spots in the file. Then they go through and clean out the garbage (i.e., free up parts that have since been overwritten, and usually coalesce the pieces of a file together as well). This can get slow if they have to do it very often though.

How is fseek() implemented in the filesystem?

This is not a pure programming question, however it impacts the performance of programs using fseek(), hence it is important to know how it works. A little disclaimer so that it doesn't get closed.
I am wondering how efficient it is to insert data in the middle of the file. Supposing I have a file with 1MB data and then I insert something at the 512KB offset. How efficient would that be compared to appending my data at the end of the file? Just to make the example complete lets say I want to insert 16KB of data.
I understand the answer varies depending on the filesystem, however I assume that the techniques used in common filesystems are quite similar and I just want to get the right notion of it.

(disclaimer: I want just to add some hints to this interesting discussion)
IMHO there are some things to take into account:
1) fseek is not a primary system service, but a library function. To evaluate its performance we must consider how the file stream library is implemented. In general, the file I/O library adds a layer of buffering in user space, so the performance of fseek may be quite different if the target position is inside or outside the current buffer. Also, the system services that the I/O libary uses may vary a lot. I.e. on some systems the library uses extensively the file memory mapping if possible.
2) As you said, different filesystems may behave in a very different way. In particular, I would expect that a transactional filesystem must do something very smart and perhaps expensive to be prepared to a possible rollback of an aborted write operation in the middle of a file.
3) Modern OS'es have very aggressive caching algorithms. An "fseeked" file is likely to be already present in cache, so operations become much faster. But they may degrade a lot if the overall filesystem activity produced by other processes become important.
Any comments?

fseek(...) is a library call, not an OS system call. It is the run-time library that takes care of the actual overhead involved in making a system call to the OS, technically speaking, fseek is indirectly making a call to the system but really it is not (this brings up a clear distinction between the differences between a library call and a system call). fseek(...) is a standard input-output function regardless of the underlying system...however...and this is a big however...
The OS will more than likely to have cached the file in its kernel memory, that is, the direct offset to the location on the disk on where the 1's and 0's are stored, it is through the OS's kernel layers, more than likely, a top-most layer within the kernel that would have the snapshot of what the file is composed of, i.e. data irrespectively of what it contains (it does not care either way, as long as the 'pointers' to the disk structure for that offset to the lcoation on the disk is valid!)...
When fseek(..) occurs, there would be a lot of over-head, indirectly, the kernel delegated the task of reading from the disk, depending on how fragmented the file is, it could be theoretically, "all over the place", that could be a significant over-head in terms of having to, from a user-land perspective, i.e. the C code doing an fseek(...), it could be scattering itself all over the place to gather the data into a "one contiguous view of the data" and henceforth, inserting into the middle of a file, (remember at this stage, the kernel would have to adjust the location/offsets into the actual disk platter for the data) would be deemed slower than appending to the end of the file.
The reason is quite simple, the kernel "knows" what was the last offset was, and simply wipe the EOF marker and insert more data, behind the scenes, the kernel, is having to allocate another block of memory for the disk-buffer with the adjusted offset to the location on the disk following an EOF marker, once the appending of data is completed.

Let us assume the ext2 FS and the Linux OS as an example. I don't think there will be a significant performance difference between a insert and an append. In both cases the files node and offset table must be read, the relevant disk sector mapped into memory, the data updated and at some later point the data written back to disk. What will make a big performance difference in this example is good temporal and spatial locality when accessing parts of the file since this will reduce the number of load/store combos.
As a previous answers says you may be able to speed up both operations if you deal with data writes that exact multiples of the FS block size, in this case you could skip the load stage and just insert the new blocks into the files inode datastrucure. This would not be practical, as you would need low level access to the FS driver, and using it would be very restrictive and not portable.

One observation I have made about fseek on Solaris, is that each call to it resets the read buffer of the FILE. The next read will then always read a full block (8K by default). So if you have a lot of random access with small reads it's a good idea to do it unbuffered (setvbuf with NULL buffer) or even use direct syscalls (lseek+read or even better pread which is only 1 syscall instead of 2). I suppose this behaviour will be similar on other OS.

You can insert data to the middle of file efficiently only if data size is a multiple of FS sector but OSes doesn't provide such functions so you have to use low-level interface to the FS driver.

Inserting data in the middle of the file is less efficient than appending to the end because when inserting you would have to move the data after the insertion point to make room for the data being inserted. Moving these data would involve reading them from disk, writing the data to be inserted and then writing the old data after the inserted data. So you have at least one extra read and write when inserting.

Optimizing locations of on-disk data for sequential access

I need to store large amounts of data on-disk in approximately 1k blocks. I will be accessing these objects in a way that is hard to predict, but where patterns probably exist.
Is there an algorithm or heuristic I can use that will rearrange the objects on disk based on my access patterns to try to maximize sequential access, and thus minimize disk seek time?

On modern OSes (Windows, Linux, etc) there is absolutely nothing you can do to optimise seek times! Here's why:
You are in a pre-emptive multitasking system. Your application and all it's data can be flushed to disk at any time - user switches task, screen saver kicks in, battery runs out of charge, etc.
You cannot guarantee that the file is contiguous on disk. Doing Aaron's first bullet point will not ensure an unfragmented file. When you start writing the file, the OS doesn't know how big the file is going to be so it could put it in a small space, fragmenting it as you write more data to it.
Memory mapping the file only works as long as the file size is less than the available address range in your application. On Win32, the amount of address space available is about 2Gb - memory used by application. Mapping larger files usually involves un-mapping and re-mapping portions of the file, which won't be the best of things to do.
Putting data in the centre of the file is no help as, for all you know, the central portion of the file could be the most fragmented bit.
To paraphrase Raymond Chen, if you have to ask about OS limits, you're probably doing something wrong. Treat your filesystem as an immutable black box, it just is what it is (I know, you can use RAID and so on to help).
The first step you must take (and must be taken whenever you're optimising) is to measure what you've currently got. Never assume anything. Verify everything with hard data.
From your post, it sounds like you haven't actually written any code yet, or, if you have, there is no performance problem at the moment.
The only real solution is to look at the bigger picture and develop methods to get data off the disk without stalling the application. This would usually be through asynchronous access and speculative loading. If your application is always accessing the disk and doing work with small subsets of the data, you may want to consider reorganising the data to put all the useful stuff in one place and the other data elsewhere. Without knowing the full problem domain it's not possible to to be really helpful.

Depending on what you mean by "hard to predict", I can think of a few options:
If you always seek based on the same block field/property, store the records on disk sorted by that field. This lets you use binary search for O(log n) efficiency.
If you seek on different block fields, consider storing an external index for each field. A b-tree gives you O(log n) efficiency. When you seek, grab the appropriate index, search it for your block's data file address and jump to it.
Better yet, if your blocks are homogeneous, consider breaking them down into database records. A database gives you optimized storage, indexing, and the ability to perform advanced queries for free.

Use memory-mapped file access rather than the usual open-seek-read/write pattern. This technique works on Windows and Unix platforms.
In this way the operating system's virtual memory system will handle the caching for you. Accesses of blocks that are already in memory will result in no disk seek or read time. Writes from memory back to disk are handled automatically and efficiently and without blocking your application.
Aaron's notes are good too as they will affect initial-load time for a chunk that's not in memory. Combine that with the memory-mapped technique -- after all it's easier to reorder chunks using memcpy() than by reading/writing from disk and attempting swapouts etc.

The most simple way to solve this is to use an OS which solves that for you under the hood, like Linux. Give it enough RAM to hold 10% of the objects in RAM and it will try to keep as many of them in the cache as possible reducing the load time to 0. The recent server versions of Windows might work, too (some of them didn't for me, that's why I'm mentioning this).
If this is a no go, try this algorithm:
Create a very big file on the harddisk. It is very important that you write this in one go so the OS will allocate a continuous space on disk.
Write all your objects into that file. Make sure that each object is the same size (or give each the same space in the file and note the length in the first few bytes of of each chunk). Use an empty harddisk or a disk which has just been defragmented.
In a data structure, keep the offsets of each data chunk and how often it is accessed. When it is accessed very often, swap its position in the file with a chunk that is closer to the start of the file and which has a lesser access count.
[EDIT] Access this file with the memory-mapped API of your OS to allow the OS to effectively cache the most used parts to get best performance until you can optimize the file layout next time.
Over time, heavily accessed chunks will bubble to the top. Note that you can collect the access patterns over some time, analyze them and do the reorder over night when there is little load on your machine. Or you can do the reorder on a completely different machine and swap the file (and the offset table) when that's done.
That said, you should really rely on a modern OS where a lot of clever people have thought long and hard to solve these issues for you.

That's an interesting challenge. Unfortunately, I don't know how to solve this out of the box, either. Corbin's approach sounds reasonable to me.
Here's a little optimization suggestion, at least: Place the most-accessed items at the center of your disk (or unfragmented file), not at the start of end. That way, seeking to lesser-used data will be closer by average. Err, that's pretty obvious, though.
Please let us know if you figure out a solution yourself.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio