True in-place file editing using GNU tools - performance

I have a very large (multiple gigabytes) file that I want to do simple operations on:
Add 5-10 lines in the end of the file.
Add 2-3 lines in the beginning of the file.
Delete a few lines in the beginning, up to a certain substring. Specifically, I need to traverse the file up to a line that says "delete me!\n" and then delete all lines in the file up to and including that line.
I'm struggling to find a tool that can do the editing in place, without creating a temporary file (very long task) that has essentially a copy of my original file. Basically, I want to minimize the number of I/O operations against the disk.
Both sed -i, and awk -i, do exactly that slow thing (https://askubuntu.com/questions/20414/find-and-replace-text-within-a-file-using-commands) and are inefficient as a result. What's a better way?
I'm on Debian.

Adding 5-10 lines at the beginning of a multi-GB file will always require fully rewriting the contents of that file, unless you're using an OS and filesystem that provides nonstandard syscalls. (You can avoid needing multiple GB of temporary space by writing back to a point in the file you're modifying from which you've already read to a buffer, but you can't avoid needing to rewrite everything past the point of the edit).
This is because UNIX only permits adding new contents to a file in a manner that changes its overall size at or past its existing end. You can edit part of a file in-place -- that is to say, you can seek 1GB in and write 1MB of new contents -- but this changes the 1MB of contents that had previously been in that location; it doesn't change the total size of the file. Similarly, you can truncate and rewrite a file at a location of your choice, but everything past the point of truncation needs to be rewritten.
An example of the nonstandard operations referred to above is the FALLOC_FL_INSERT_RANGE and FALLOC_FL_COLLAPSE_RANGE operations, which with very new Linux kernels will allow blocks to be inserted to or removed from an existing file. This is unlikely to be helpful to you here:
Only exact blocks (ie. 4kb -- whatever your filesystem is formatted for) can be inserted, not individual lines of text of arbitrary size.
Only XFS and ext4 are supported.
See the documentation for fallocate(2).

here is a recommendation for editing large files (change the lines and number of digits based on your file length and number of sections to work on)
split -l 1000 -a 4 -d bigfile bigfile_
for that you need space, since bigfile won't be removed
insert header as first line
sed -i '1iheader` bigfile_000
search a specific pattern, get the file name and remove the previous sections.
grep pattern bigfile_*
etc.
Once all editing is done, just cat back the remaining pieces
cat bigfile_* > edited_bigfile

Related

Why is in-place edditing of a file slower than making a new file?

As you can see in this answer. It seems like editing a text file in-place takes much more time than creating a new file, deleting the old file and moving a temporary file from another file-system and renaming it. Let alone creating a new file in the same file-system and just renaming it. I was wondering what is the reason behind that?
Because when you edit a file inplace you are opening the same file for both writing and reading. But when you use another file. you only read from one file and write to another file.
When you open a file for reading it's content are moved from disk to memory. Then after, when you want to edit the file you change the content of the file in the disk so the content you have in memory should be updated to prevent data inconsistency. But when you use a new file. You don't have to update the contents of the first file in the memory. You just read the whole file once and write the other file once. And don't update anything. Removing a file also takes very small time because you just remove it from the file system and you don't write any bits to the location of the file in the disk. The same goes for renaming. Moving can also be done very fast depending on the file-system but most likely not as fast as removing and renaming.
There is also another more important reason.
When you remove the numbers from the beginning of the first line, all of the other characters have to be shifted back a little. Then when you remove the numbers from the second line again all of the characters after that point have to be shifted back because the characters have to be consecutive. If you wanted to just change some characters, editing in place would have been a lit faster. But since you are changing the length of the file on each removal all of the other characters have to get shifted and that takes so much time. It's not exactly like this and it's much more complicated depending on the implementation of your operation system and your file-system but this is the idea behind it. It's like array operation. When you remove a single element from an array you have to shift all of the other elements of the array. Because it is an array. In contrast if you were to remove an element from a linked list you didn't need to shift other elements but files are implemented similar to arrays so that is that.
While tgwtdt's answer may give a few good insights it does not explain everything. Here is a counter example on a 140MB file:
$ time sed 's/a/b/g' data > newfile
real 0m2.612s
$ time sed -i -- 's/a/b/g' data
real 0m9.906s
Why is this a counter example, you may ask. Because I replace a with b which means that the replacement text has the same length. Thus, no data needs to be moved, but it still took about four times longer.
While tgwtdt gave a good reasoning for why in place usually takes longer, it's a question that cannot be answered 100% for the general case, because it is implementation dependent.

How can you identify a file without a filename or filepath?

If I were to give you a file. You can read the file but you can't change it or copy it. Then I take the file, rename it, move it to a new location. How could you identify that file? (Fairly reliably)
I'm looking if I have a database of media files for a program and the user alters the location/name of file, could I find the file by searching a directory and looking for something.
I have done exactly this, it's not hard.
I take a 256-bit hash (I forget which routine I used off the top of my head) of the file and the filesize and write it to a table. If they match the files match. (And I think tracking the size is more paranoia than necessity.) To speed things up I also fold that hash to a 32-bit value. If the 32-bit values match then I check all the data.
For the sake of performance I persist the last 10 million files I have examined. The 32-bit values go in one file which is read in it's entirety, when a main record needs to be examined I pull in a "page" (I forget exactly how big) of them which is padded to align it with the disk.

Embarrassingly parallel workflow creates too many output files

On a Linux cluster I run many (N > 10^6) independent computations. Each computation takes only a few minutes and the output is a handful of lines. When N was small I was able to store each result in a separate file to be parsed later. With large N however, I find that I am wasting storage space (for the file creation) and simple commands like ls require extra care due to internal limits of bash: -bash: /bin/ls: Argument list too long.
Each computation is required to run through a qsub scheduling algorithm so I am unable to create a master program which simply aggregates the output data to a single file. The simple solution of appending to a single fails when two programs finish at the same time and interleave their output. I have no admin access to the cluster, so installing a system-wide database is not an option.
How can I collate the output data from embarrassingly parallel computation before it gets unmanageable?
1) As you say, it's not ls which is failing; it's the shell which does glob expansion before starting up ls. You can fix that problem easily enough by using something like
find . -type f -name 'GLOB' | xargs UTILITY
eg.:
find . -type f -name '*.dat' | xargs ls -l
You might want to sort the output, since find (for efficiency) doesn't sort the filenames (usually). There are many other options to find (like setting directory recursion depth, filtering in more complicated ways, etc.) and to xargs (maximum number of arguments for each invocation, parallel execution, etc.). Read the man pages for details.
2) I don't know how you are creating the individual files, so it's a bit hard to provide specific solutions, but here are a couple of ideas:
If you get to create the files yourself, and you can delay the file creation until the end of the job (say, by buffering output), and the files are stored on a filesystem which supports advisory locking or some other locking mechanism like atomic linking, then you can multiplex various jobs into a single file by locking it before spewing the output, and then unlocking. But that's a lot of requirements. In a cluster you might well be able to do that with a single file for all the jobs running on a single host, but then again you might not.
Again, if you get to create the files yourself, you can atomically write each line to a shared file. (Even NFS supports atomic writes but it doesn't support atomic append, see below.) You'd need to prepend a unique job identifier to each line so that you can demultiplex it. However, this won't work if you're using some automatic mechanism such as "my job writes to stdout and then the scheduling framework copies it to a file", which is sadly common. (In essence, this suggestion is pretty similar to the MapReduce strategy. Maybe that's available to you?)
Failing everything else, maybe you can just use sub-directories. A few thousand directories of a thousand files each is a lot more manageable than a single directory with a few million files.
Good luck.
Edit As requested, some more details on 2.2:
You need to use Posix I/O functions for this, because, afaik, the C library does not provide atomic write. In Posix, the write function always writes atomically, provided that you specify O_APPEND when you open the file. (Actually, it writes atomically in any case, but if you don't specify O_APPEND then each process retains it's own position into the file, so they will end up overwriting each other.)
So what you need to do is:
At the beginning of the program, open a file with options O_WRONLY|O_CREATE|O_APPEND. (Contrary to what I said earlier, this is not guaranteed to work on NFS, because NFS may not handle O_APPEND properly. Newer versions of NFS could theoretically handle append-only files, but they probably don't. Some thoughts about this a bit later.) You probably don't want to always use the same file, so put a random number somewhere into its name so that your various jobs have a variety of alternatives. O_CREAT is always atomic, afaik, even with crappy NFS implementations.
For each output line, sprintf the line to an internal buffer, putting a unique id at the beginning. (Your job must have some sort of unique id; just use that.) [If you're paranoid, start the line with some kind of record separator, followed by the number of bytes in the remaining line -- you'll have to put this value in after formatting -- so the line will look something like ^0274:xx3A7B29992A04:<274 bytes>\n, where ^ is hex 01 or some such.]
write the entire line to the file. Check the return code and the number of bytes written. If the write fails, try again. If the write was short, hopefully you followed the "if you're paranoid" instructions above, also just try again.
Really, you shouldn't get short writes, but you never know. Writing the length is pretty simple; demultiplexing is a bit more complicated, but you could cross that bridge when you need to :)
The problem with using NFS is a bit more annoying. As with 2.1, the simplest solution is to try to write the file locally, or use some cluster filesystem which properly supports append. (NFSv4 allows you to ask for only "append" permissions and not "write" permissions, which would cause the server to reject the write if some other process already managed to write to the offset you were about to use. In that case, you'd need to seek to the end of the file and try the write again, until eventually it succeeds. However, I have the impression that this feature is not actually implemented. I could be wrong.)
If the filesystem doesn't support append, you'll have another option: decide on a line length, and always write that number of bytes. (Obviously, it's easier if the selected fixed line length is longer than the longest possible line, but it's possible to write multiple fixed-length lines as long as they have a sequence number.) You'll need to guarantee that each job writes at different offsets, which you can do by dividing the job's job number into a file number and an interleave number, and write all the lines for a particular job at its interleave modulo the number of interleaves, into a file whose name includes the file number. (This is easiest if the jobs are numbered sequentially.) It's OK to write beyond the end of the file, since unix filesystems will -- or at least, should -- either insert NULs or create discontiguous files (which waste less space, but depend on the blocksize of the file).
Another way to handle filesystems which don't support append but do support advisory byte-range locking (NFSv4 supports this) is to use the fixed-line-length idea, as above, but obtaining a lock on the range about to be written before writing it. Use a non-blocking lock, and if the lock cannot be obtained, try again at the next line-offset multiple. If the lock can be obtained, read the file at that offset to verify that it doesn't have data before writing it; then release the lock.
Hope that helps.
If you are only concerned by space:
parallel --header : --tag computation {foo} {bar} {baz} ::: foo 1 2 ::: bar I II ::: baz . .. | pbzip2 > out.bz2
or shorter:
parallel --tag computation ::: 1 2 ::: I II ::: . .. | pbzip2 > out.bz2
GNU Parallel ensures output is not mixed.
If you are concerned with finding a subset of the results, then look at --results.
Watch the intro videos to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Another possibility would be to use N files, with N greater or equal to the number of nodes in the cluster, and assign the files to your computations in a round-robin fashion. This should avoid concurrent writes to any of the files, provided you have a reasonnable guarantee on the order of execution of your computations.

fast deletion of a line with an index from a file

I have a HUGE file of 10G. I want to remove line 188888 from this file.
I use sed as follows:
sed -i '188888d' file
The problem is it is really slow. I understand it is because of the size of the file, but is there any way that I can do that faster.
Thanks
Try
sed -i '188888{;d;q;}' file
You may need to experiment with which of the above semi-colons you keep, {d;q} ... being the 2nd thing to try.
This will stop searching the file after it deletes that one line, but you'll still have to spend the time re-writing the file. It would also be worth testing
sed '188888{;q;d;}' file > /path/to/alternate/mountpoint/newFile
where the alternate mountpoint is on a separate disk drive.
final edit
Ah, one other option would be to edit the file while it is being written through a pipe
yourLogFileProducingProgram | sed -i '188888d' > logFile
But this assumes that you know that the data your want to delete is always at line '188888, is that possible?
I hope this helps.
The file lines are determined by counting the \n character, if the line size are variable then you cannot calculate the offset to the location given a line but have to count the number of newlines.
This will always be O(n) where n is the number of bytes in the file.
Parallel algorithms does not help either because this operation is disk IO limited, divide and conquer will be even slower.
If you will do this a lot on a same file, there are ways to preprocess the file and make it faster.
A easy way is to build a index with
line#:offset
And when you want to find a line, do binary search (Log n) in index for the line number you want, and use the offset to locate the line in the original file.

compare 2 files and copy source if different from destination - vbscript?

I'm working on Windows XP and I need to make a script that would compare 2 files (1 on a server and 1 on a client). Basically, I need my script to check if the file from the client is different from the server version and replace the client version if it finds a difference (in the file itself, not only the modification date).
As you suggest, you can skip the date check as that can be changed without the contents changing.
First check that the sizes are different. If so, that may be enough to conclude that they are different. This can have false positives too though depending on the types of files. For example a unicode text file may contain the exact same content as an ansi text file, but be encoded with two bytes per character. If it's a script, it would execute with exactly the same results, but be twice the size.
If the sizes are the same, they may still contain different bytes. The brute force test would be to load each file into a string and compare them for equality. If they are big files and you don't want to read them all into memory if not necessary, then read them line by line until you encounter a difference. That's assuming they are text files. If they aren't text files, you can do something similar by reading them in fixed size chunks and comparing those.
Another option would be to to run the "fc" file compare command on the two files and capture the result and do your update based on that.

Resources