Prepending to a multi-gigabyte file - performance

What would be the most performant way to prepend a single character to a multi-gigabyte file (in my practical case, a 40GB file).
There is no limitation on the implementation to do this. Meaning it can be through a tool, a shell script, a program in any programming language, ...

There is no really simple solution. There are no system calls to prepend data, only append or rewrite.
But depending on what you're doing with the file, you may get away with tricks.
If the file is used sequentially, you could make a named pipe and put cat onecharfile.txt bigfile > namedpipe and then use "namedpipe" as file. The same can be achieved by cat onecharfile.txt bigfile | program if your program takes stdin as input.
For random access a FUSE filesystem could be done, but probably waay too complicated for this.
If you want to get your hands really dirty, figure out howto
allocate a datablock (about inode and datablock structure)
insert it into a file's chain as second block (or first and then you're practically done)
write the beginning of file into that block
write the single character as first in file
mark first block as if it uses only one byte of available payload (this is possible for last block, I don't know if it's possible for blocks in middle of file chain).
This has possibilities to majorly wreck your filesystem though, so not recommended; good fun.

Let the file have an initial block of null characters. When you prepend a character, read the block, insert the character right-to-left, and write back the block. When the block is full, then do the more expensive full rewrite in order to prepend another null block. That way, you can reduce the number of times by a large factor that you have to do a full rewrite.
Added: Keep the file in two subfiles: A (a short one) and B (a long one). Prepend to A any way you like. When A gets "big enough", prepend A to B (by re-writing), and clear A.
Another way: Keep the file as a directory of small files ..., A000003, A000002, A000001.
Just prepend to the largest-numbered file. When it's big enough, make the next file in sequence.
When you need to read the file, just read them all in descending order.

You might be able to invert your implementation depending on your problem: append single characters to the end of your file. When it comes time to read the file, read it in reverse.
Hide this behind enough of an abstraction layer and it may not make a difference to your code how the bytes are physically stored.

If you use linux you could try to use a custom version of READ(2) loaded with LD_PRELOAD and have it prepend your data at the first read.
See https://zlibc.linux.lu/zlibc.html for implementation inspiration.

if you mean prepend that character to the start of the entire file, one way
$ echo "C" > tmp
$ cat my40gbfile >> tmp
$ mv tmp my40gbfile
or using sed
$ sed -i '1i C' my40gbfile
if you mean prepending the character to every line of the file
$ awk '{print "C"$0}' my40gbfile > temp && mv temp my40gbfile

As I understand, this is handled on the file system level, meaning if you prepend data to a file, it effectively rewrites the file. This is the same reason why the ID3 tags in MP3 files are zero padded, so that future updates don't rewrite the entire file, but just update those reserved bytes.
So whichever way you use will give roughly similar results. What you can try is do some tests with a custom copy function, that reads/writes in bigger chunks than the default system copy, say 2MB or 5MB, which might improve performance. Ultimately your disk I/O is the bottleneck here.

The absolutely most high-performance way would seem to be to get down into the level of sectors and how the file is actually stored. I'm not sure if the OS then becomes a factor, but the target platform might, anyway it's useful for us to know what you run on.
I think this is a case where C is the obvious choice, this kind of low-level stuff is exactly what a systems programming language is for.
Can you tell us what you end up doing, would be interesting.

Here's the Windows command line ("DOS") way:
Put your 1 char into prepend.txt
copy /b prepend.txt + myHugeFile fileNameOfCombinedFile

Related

GNU split (UNIX command) creating files not matching pattern after reaching "z"

So I was spliting some large files, everything worked properly until a file of 81GB came to scene. The split command seems that made its job, but the last files has a non correlated name. Look at the right bottom of picture.
And I'm using the command like this:
split -b 125M ./2014.txt 2014/2014_
Anyone knows why instead of create the file 2014_za created the 2014_zaaa?
You can only have 676 files named [a-z][a-z], while your command required more.
Here are some options for what split could do:
Crash.
This is the behavior mandated by POSIX, and followed by macOS.
Start writing larger suffixes.
This is a bad choice because after _zz comes _aaa, but now the files will show up in the wrong order in ls and cat * will no longer join them in correct order.
Save the last range, _z, for longer suffixes.
This is a good choice because after _yz comes _zaaa, which has room to grow while still remaining in alphabetical order. This is what GNU does, and the behavior you're seeing.
If you want all the names to be uniform without triggering any of these behaviors, just use a larger suffix length with -a 6 to ensure you have enough room.

How to extract specific lines from a huge data file?

I have a very large data file, about 32GB. The file is made up of about 130k lines, each of which mainly contains numbers, but also has few characters.
The task I need to perform is very clear: I have to extract 20 lines and write them to a new text file.
I know the exact line number for each of the 20 lines that I want to copy.
So the question is: how can I extract the content at a specific line number from the large file? I am on Windows. Is there a tool that can do such sort of operations, or I need to write some code?
If there is no direct way of doing that, I was thinking that a possible approach is to first extract small blocks of the original file (so that each block contains one or more lines to extract) and then use a standard editor to find the lines within each block. In this case, the question would be: how can I split a large file in blocks by line on windows? I use a tool named HJ-Split which works very well with large files, but it can only split by size, not by line.
Install[1] Babun Shell (or Cygwin, but I recommend the Babun), and then use sed command as described here: How can I extract a predetermined range of lines from a text file on Unix?
[1] Installing Babun means actually just unzipping it somewhere, so you don't have to have the Administrator rights on the server.

Ruby - Delete the last character in a file?

Seems like it must be easy, but I just can't figure it out. How do you delete the very last character of a file using Ruby IO?
I took a look at the answer for deleting the last line of a file with Ruby but didn't fully understand it, and there must be a simpler way.
Any help?
There is File.truncate:
truncate(file_name, integer) → 0
Truncates the file file_name to be at most integer bytes long. Not available on all platforms.
So you can say things like:
File.truncate(file_name, File.size(file_name) - 1)
That should truncate the file with a single system call to adjust the file's size in the file system without copying anything.
Note that not available on all platforms caveat though. File.truncate should be available on anything unixy (such as Linux or OSX), I can't say anything useful about Windows support.
I assume you are referring to a text file. The usual way of changing such is to read it, make the changes, then write a new file:
text = File.read(in_fname)
File.write(out_fname, text[0..-2])
Insert the name of the file you are reading from for in_fname and the name of the file you are writing to for 'out_fname'. They can be the same file, but if that's the intent it's safer to write to a temporary file, copy the temporary file to the original file then delete the temporary file. That way, if something goes wrong before the operations are completed, you will probably still have either the original or temporary file. text[0..-2] is a string comprised of all characters read except for the last one. You could alternatively do this:
File.write(out_fname, File.read(in_fname, File.stat(in_fname).size-1))

Is there an elegant way to parse a text file *backwards*? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How to read a file from bottom to top in Ruby?
In the course of working on my Ruby program, I had the Eureka Moment that it would be much simpler to write if I were able to parse the text files backwards, rather than forward.
It seems like it would be simple to simply read the text file, line by line, into an array, then write the lines backwards into a text file, parse this temp file forwards (which would now effectively be going backwards) make any necessary changes, re-catalog the resulting lines into an array, and write them backwards a second time, restoring the original direction, before saving the modifications as a new file.
While feasible in theory, I see several problems with it in practice, the biggest of which is that if the size of the text file is very large, a single array will not be able to hold the entirety of the document at once.
Is there a more elegant way to accomplish reading a text file backwards?
If you are not using lots of UTF-8 characters you can use Elif library which work just like File.open. just load Elif and replace File.open with Elif.open
Elif.open('read.txt', "r").each_line{ |s|
puts s
}
This is a great library, but the only problem I am experiencing right now is that it have several problems with line ending in UTF-8. I now have to re-think a way to iterate my files
Additional Details
As I google a way to answer this problem for UTF-8 reverse file reading. I found a way that already implemented by File library:
To read a file backward you can try the ff code:
File.readlines('manga_search.test.txt').reverse_each{ |s|
puts s
}
This can do a good job as well
There's no software limit to Ruby array. There are some memory limitations though: Array size too big - ruby
Your approach would work much faster if you can read everything into memory, operate there and write it back to disk. Assuming the file fits in memory of course.
Let's say your lines are 80 chars wide on average, and you want to read 100 lines. If you want it efficient (as opposed to implemented with the least amount of code), then go back 80*100 bytes from the end (using seek with the "relative to end" option), then read ONE line (this is likely a partial one, so throw it away). Remember your current position via tell, then read everything up until the end.
You now have either more or less than a 100 lines in memory. If less, go back (100+1.5*no_of_missing_lines)*80, and repeat the above steps, but only reading lines until you reach your remembered position from before. Rinse and repeat.
How about just going to the end of the file and iterating backwards over each char until you reach a newline, read the line, and so on? Not elegant, but certainly effective.
Example: https://gist.github.com/1117141
I can't think of an elegant way to do something so unusual as this, but you could probably do it using the file-tail library. It uses random access files in Ruby to read it backwards (and you could even do it yourself, look for random access at this link).
You could go throughout the file once forward, storing only the byte offset of each \n instead of storing the full string for each line. Then you traverse your offset array backward and can use ios.sysseek and ios.sysread to get lines out of the file. Unless your file is truly enormous, that should alleviate the memory issue.
Admitedly, this absolutely fails the elegance test.

Can one move back up a text file in VB6?

I have a program that is reading a text file, and owing to the vagaries of the file definition and the definitions of the objects the data has to be shovelled in to, I appear to have a need to move the read pointer of the file back up the file for a line, in a manner roughly analagous to the FORTRAN BACKSPACE statement.
Is there any method of doing this, either with native VB6 statements or with VB6 FileSystem objects?
I'm pretty sure VB6 provides a seek() function to do this.
Otherwise, if the file is relatively small you could read it all into memory and use the split() function to separate it into lines. These could then be accessed however you want. Obviously if the file is large this is not a good idea though.
The FSO only lets you read forwards.
There isn't a way to do this in VB6. What you could do is to either read the whole file, a line at a time, into an array and then iterate through the array as needed. Or if that caused memory issues, then create a data structure and use Input to read a line into an instance of the structure based upon the line number.

Resources