When would you use zipsplit vs split from a Linux/Unix command line?
I had to break up a large zip file from the command line to keep ftp from timing out (yes I was using passive mode) and someone suggested zipsplit. This command wouldn't break into units smaller than the largest file and everything didn't rejoin properly when I tried to rejoin on the other side. The split command let me break any file into an arbitrary size and rejoined by piping a simple cat command.
I was just frustrated and just don't understand why zipsplit exists compared to split.
zipsplit splits zip file into smaller valid, readable, decompressible zipfiles. With split, you can't do anything useful with pieces until you join them back.
Related
I posted the following code and got scolded. Why is this not acceptable?
numberOfLines=$(wc -l <"$1")
for ((i=1; $i<=$numberOfLines; ++$i)); do
lineN=$(sed -n "$i!d;p;q" "$1")
# ... do things with "$lineN"
done
We collect the number of lines in the input file into numberOfLines, then loop from 1 to that number, pulling out the next line from the file with sed in each iteration.
The feedback I received complained that reading the same file repeatedly with sed inside the loop to get the next line is inefficient. I guess I could use head -n "$i" "$1" | tail -n 1 but that's hardly more efficient, is it?
Is there a better way to do this? Why would I want to avoid this particular approach?
The shell (and basically every programming language which is above assembly language) already knows how to loop over the lines in a file; it does not need to know how many lines there will be to fetch the next one — strikingly, in your example, sed already does this, so if the shell couldn't do it, you could loop over the output from sed instead.
The proper way to loop over the lines in a file in the shell is with while read. There are a couple of complications — commonly, you reset IFS to avoid having the shell needlessly split the input into tokens, and you use read -r to avoid some pesky legacy behavior with backslashes in the original Bourne shell's implementation of read, which have been retained for backward compatibility.
while IFS='' read -r lineN; do
# do things with "$lineN"
done <"$1"
Besides being much simpler than your sed script, this avoids the problem that you read the entire file once to obtain the line count, then read the same file again and again in each loop iteration. With a typical modern OS, some repeated reading will be avoided thanks to caching (the disk driver keeps a buffer of recently accessed data in memory, so that reading it again will not actually require fetching it from the disk again), but the basic fact is still that reading information from disk is on the order of 1000x slower than not doing it when you can avoid it. Especially with a large file, the cache will fill up eventually, and so you end up reading in and discarding the same bytes over and over, adding a significant amount of CPU overhead and an even more significant amount of the CPU simply doing something else while waiting for the disk to deliver the bytes you read, again and again.
In a shell script, you also want to avoid the overhead of an external process if you can. Invoking sed (or the functionally equivalent but even more expensive two-process head -n "$i"| tail -n 1) thousands of times in a tight loop will add significant overhead for any non-trivial input file. On the other hand, if the body of your loop could be done in e.g. sed or Awk instead, that's going to be a lot more efficient than a native shell while read loop, because of the way read is implemented. This is why while read is also frequently regarded as an antipattern.
And make sure you are reasonably familiar with the standard palette of Unix text processing tools - cut, paste, nl, pr, etc etc.
In many, many cases you should avoid looping over the lines in a shell script and use an external tool instead. There is basically only one exception to this; when the body of the loop is also significantly using built-in shell commands.
The q in the sed script is a very partial remedy for repeatedly reading the input file; and frequently, you see variations where the sed script will read the entire input file through to the end each time, even if it only wants to fetch one of the very first lines out of the file.
With a small input file, the effects are negligible, but perpetuating this bad practice just because it's not immediately harmful when the input file is small is simply irresponsible. Just don't teach this technique to beginners. At all.
If you really need to display the number of lines in the input file, for a progress indicator or similar, at least make sure you don't spend a lot of time seeking through to the end just to obtain that number. Maybe stat the file and keep track of how many bytes there are on each line, so you can project the number of lines you have left (and instead of line 1/10345234 display something like line 1/approximately 10000000?) ... or use an external tool like pv.
Tangentially, there is a vaguely related antipattern you want to avoid, too; you don't want to read an entire file into memory when you are only going to process one line at a time. Doing that in a for loop also has some additional gotchas, so don't do that, either; see https://mywiki.wooledge.org/DontReadLinesWithFor
Another common variation is to find the line you want to modify with grep, only so you can find it with sed ... which already knows full well how to perform a regex search by itself. (See also useless use of grep.)
# XXX FIXME: wrong
line=$(grep "foo" file)
sed -i "s/$line/thing/" file
The correct way to do this would be to simply change the sed script to contain a search condition:
sed -i '/foo/s/.*/thing/' file
This also avoids the complications when the value of $line in the original, faulty script contains something which needs to be escaped in order to actually match itself. (For example, foo\bar* in a regular expression does not match the literal text itself.)
There are two similar function for reading file in twincat software for Beckhoff company. FB_FileGets and FB_FileRead. I will be appreciate if someone explain what are the differences of these function and clear when we use each of them. both of them have same prerequisite or not, use in same way in programs? which one has better speed(fast reading in different file format) and any inform that make them clear for better programming.
vs
The FB_FileGets reads the file line by line. So when you call it, you always get a one line of the text file as string. The maximum length of a line is 255 characters. So by using this function block, it's very easy to read all lines of a file. No need for buffers and memory copying, if the 255 line length limit is ok.
THe FB_FileReadreads given number of bytes from the file. So you can read files with for example 65000 characters in a single line.
I would use the FB_FileGets in all cases where you know that the lines are less than 255 characters and the you handle the data as line-by-line. It's very simple to use. If you have no idea of the line sizes, you need all data at once or the file is very big, I would use the FB_FileRead.
I haven't tested but I think that the FB_FileReadis probably faster, as it just copies the bytes to buffer. And you can read the whole file at once, not line-by-line.
I have a very large (multiple gigabytes) file that I want to do simple operations on:
Add 5-10 lines in the end of the file.
Add 2-3 lines in the beginning of the file.
Delete a few lines in the beginning, up to a certain substring. Specifically, I need to traverse the file up to a line that says "delete me!\n" and then delete all lines in the file up to and including that line.
I'm struggling to find a tool that can do the editing in place, without creating a temporary file (very long task) that has essentially a copy of my original file. Basically, I want to minimize the number of I/O operations against the disk.
Both sed -i, and awk -i, do exactly that slow thing (https://askubuntu.com/questions/20414/find-and-replace-text-within-a-file-using-commands) and are inefficient as a result. What's a better way?
I'm on Debian.
Adding 5-10 lines at the beginning of a multi-GB file will always require fully rewriting the contents of that file, unless you're using an OS and filesystem that provides nonstandard syscalls. (You can avoid needing multiple GB of temporary space by writing back to a point in the file you're modifying from which you've already read to a buffer, but you can't avoid needing to rewrite everything past the point of the edit).
This is because UNIX only permits adding new contents to a file in a manner that changes its overall size at or past its existing end. You can edit part of a file in-place -- that is to say, you can seek 1GB in and write 1MB of new contents -- but this changes the 1MB of contents that had previously been in that location; it doesn't change the total size of the file. Similarly, you can truncate and rewrite a file at a location of your choice, but everything past the point of truncation needs to be rewritten.
An example of the nonstandard operations referred to above is the FALLOC_FL_INSERT_RANGE and FALLOC_FL_COLLAPSE_RANGE operations, which with very new Linux kernels will allow blocks to be inserted to or removed from an existing file. This is unlikely to be helpful to you here:
Only exact blocks (ie. 4kb -- whatever your filesystem is formatted for) can be inserted, not individual lines of text of arbitrary size.
Only XFS and ext4 are supported.
See the documentation for fallocate(2).
here is a recommendation for editing large files (change the lines and number of digits based on your file length and number of sections to work on)
split -l 1000 -a 4 -d bigfile bigfile_
for that you need space, since bigfile won't be removed
insert header as first line
sed -i '1iheader` bigfile_000
search a specific pattern, get the file name and remove the previous sections.
grep pattern bigfile_*
etc.
Once all editing is done, just cat back the remaining pieces
cat bigfile_* > edited_bigfile
I have a HUGE file of 10G. I want to remove line 188888 from this file.
I use sed as follows:
sed -i '188888d' file
The problem is it is really slow. I understand it is because of the size of the file, but is there any way that I can do that faster.
Thanks
Try
sed -i '188888{;d;q;}' file
You may need to experiment with which of the above semi-colons you keep, {d;q} ... being the 2nd thing to try.
This will stop searching the file after it deletes that one line, but you'll still have to spend the time re-writing the file. It would also be worth testing
sed '188888{;q;d;}' file > /path/to/alternate/mountpoint/newFile
where the alternate mountpoint is on a separate disk drive.
final edit
Ah, one other option would be to edit the file while it is being written through a pipe
yourLogFileProducingProgram | sed -i '188888d' > logFile
But this assumes that you know that the data your want to delete is always at line '188888, is that possible?
I hope this helps.
The file lines are determined by counting the \n character, if the line size are variable then you cannot calculate the offset to the location given a line but have to count the number of newlines.
This will always be O(n) where n is the number of bytes in the file.
Parallel algorithms does not help either because this operation is disk IO limited, divide and conquer will be even slower.
If you will do this a lot on a same file, there are ways to preprocess the file and make it faster.
A easy way is to build a index with
line#:offset
And when you want to find a line, do binary search (Log n) in index for the line number you want, and use the offset to locate the line in the original file.
I'm working on Windows XP and I need to make a script that would compare 2 files (1 on a server and 1 on a client). Basically, I need my script to check if the file from the client is different from the server version and replace the client version if it finds a difference (in the file itself, not only the modification date).
As you suggest, you can skip the date check as that can be changed without the contents changing.
First check that the sizes are different. If so, that may be enough to conclude that they are different. This can have false positives too though depending on the types of files. For example a unicode text file may contain the exact same content as an ansi text file, but be encoded with two bytes per character. If it's a script, it would execute with exactly the same results, but be twice the size.
If the sizes are the same, they may still contain different bytes. The brute force test would be to load each file into a string and compare them for equality. If they are big files and you don't want to read them all into memory if not necessary, then read them line by line until you encounter a difference. That's assuming they are text files. If they aren't text files, you can do something similar by reading them in fixed size chunks and comparing those.
Another option would be to to run the "fc" file compare command on the two files and capture the result and do your update based on that.