I have a text file which weights a certain amount of bytes.
To test my file system, I need to programatically make the file 4096 bytes bigger from a bash script.
I am under the impression that this is doable using the truncate command, but I cannot figure out how - typing truncate myfile.txt -s 4096 will leave me with a 4096 bytes file.
For me something like this is working:
truncate -s +4096 myfile.txt
this appends 4069 bytes to the given file. I think you miss the plus sign.
truncate is useful because appending bytes can't shrink a file. However, growing a file is as simple as
printf '%4096s' >> myfile.txt
which adds 4096 space characters to the end of the file.
Related
I would like to use a linux shell (bash, zsh, etc.) to insert a set of known bytes into a file at a certain position. Similar questions have been asked, but they modify in-place the bytes of a file. These questions don't address inserting new bytes at particular positions.
For example, if my file has a sequence of bytes like \x32\x33\x35 I might want to insert \x34 at position 2 so that this byte sequence in the file becomes \x32\x33\x34\x35.
You can achieve this using head, tail and printf together. For example; to insert \x34 at position 2 in file:
{ head -c 2 file; printf '\x34'; tail -c +3 file; } > new_file
For POSIX-compliance, \064 (octal representation of \x34) can be used.
To make this change in-place, just move new_file to file.
No matter which tool(s) you use, this operation will cost lots of CPU time for huge files.
So I've read through this question on SO but it does not quite help me any. I want to import a Gmail generated mbox file into another webmail service, but the problem is it only allows 40 MB huge files per import.
So I somehow have to split the mbox file into max. 40 MB big files and import them one after another. How would you do this?
My initial thought was to use the other script (formail) to save each mail as a single file and afterwards run a script to combine them to 40 MB huge files, but still I wouldnt know how to do this using the terminal.
I also looked at the split command, but Im afraid it would cutoff mails.
Thanks for any help!
I just improved a script from Mark Sechell's answer. As We can see, that script can parse the mbox file based on the amount of email per chunk. This improved script can parse the mbox file based on the defined-maximum-size for each chunk.
So, if you have size limitation in uploading or importing the mbox file, you can try the script below to split the mbox file into chunks with specified size*.
Save the script below to a text file, e.g. mboxsplit.txt, in the directory that contains the mbox file (e.g. named mbox):
BEGIN{chunk=0;filesize=0;}
/^From /{
if(filesize>=40000000){#file size per chunk in byte
close("chunk_" chunk ".txt");
filesize=0;
chunk++;
}
}
{filesize+=length()}
{print > ("chunk_" chunk ".txt")}
And then run/type this line in that directory (contains the mboxsplit.txt and the mbox file):
awk -f mboxsplit.txt mbox
Please note:
The size of the result may be larger than the defined size. It depends on the last email size inserted into the buffer/chunk before checking the chunk size.
It will not split the email body
One chunk may contain only one email if the email size is larger than the specified chunk size
I suggest you to specify the chunk size less or lower than the maximum upload/import size.
If your mbox is in standard format, each message will begin with From and a space:
From someone#somewhere.com
So, you could COPY YOUR MBOX TO A TEMPORARY DIRECTORY and try using awk to process it, on a message-by-message basis, only splitting at the start of any message. Let's say we went for 1,000 messages per output file:
awk 'BEGIN{chunk=0} /^From /{msgs++;if(msgs==1000){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}' mbox
then you will get output files called chunk_1.txt to chunk_n.txt each containing up to 1,000 messages.
If you are unfortunate enough to be on Windows (which is incapable of understanding single quotes), you will need to save the following in a file called awk.txt
BEGIN{chunk=0} /^From /{msgs++;if(msgs==1000){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}
and then type
awk -f awk.txt mbox
formail is perfectly suited for this task. You may look at formail's +skip and -total options
Options
...
+skip
Skip the first skip messages while splitting.
-total
Output at most total messages while splitting.
Depending on the size of your mailbox and mails, you may try
formail -100 -s <google.mbox >import-01.mbox
formail +100 -100 -s <google.mbox >import-02.mbox
formail +200 -100 -s <google.mbox >import-03.mbox
etc.
The parts need not be of equal size, of course. If there's one large e-mail, you may have only formail +100 -60 -s <google.mbox >import-02.mbox, or if there are many small messages, maybe formail +100 -500 -s <google.mbox >import-02.mbox.
To look for an initial number of mails per chunk, try
formail -100 -s <google.mbox | wc
formail -500 -s <google.mbox | wc
formail -1000 -s <google.mbox | wc
You may need to experiment a bit, in order to accommodate to your mailbox size. On the other hand, since this seems to be a one time task, you may not want to spend too much time on this.
My initial thought was to use the other script (formail) to save each mail as a single file and afterwards run a script to combine them to 40 MB huge files, but still I wouldnt know how to do this using the terminal.
If I understand you correctly, you want to split the files up, then combine them into a big file before importing them. That sounds like what split and cat were meant to do. Split splits the files based upon your size specification whether based upon line or bytes. It then adds a suffix to these files to keep them in order, You then use cat to put the files back together:
$ split -b40m -a5 mbox # this makes mbox.aaaaa, mbox.aaab, etc.
Once you get the files on the other system:
$ cat mbox.* > mbox
You wouldn't do this if you want to break the files so messages aren't split between files because you are going to import each file into the new mail system one at a time.
I have 305 files. Each is ~10M lines. I only need to alter the first 20 lines of each file.
Specifically I need to add # as the first char of the first 18 Lines, delete the 19th line (but safer to say, delete all lines that are completely blank, and replace > with # on the 20th line.
The remaining 9.9999999M lines dont need to change at all.
If the files were not gzipped, I could do something like:
while read F; do
for i in $(seq 1 100); do
awk '{gsub(/#/,"##"); print $0}' $F
awk more commands
awk more commnds
done
done < "$FNAMES"
but what is really throwing a wrench is the fact the files are all gzipped. Is there any way to efficiently alter these 20 lines without unzipping and / or rewriting the whole file?
No, it is not possible. With adaptive compression schemes (such as the Lempel-Ziv system gzip uses), it adjusts the encoding based on what it sees as it goes through the file. This means that the way the end of the file gets compressed (and hence decompressed) depends on the beginning of the file. If you change just the beginning of the (compressed) file, you'll change how the end gets decompressed, essentially corrupting the file.
So decompressing, modifying, and recompressing is the only way to do it.
I am running a program (pianobar) piped to a text file, that outputs every second. The resulting file ("pianobarout.txt") needs to be cleared regularly, or it grows to massive proportions. However, I do not want to stop pianobar to clear the file.
I have tried running > pianobarout.txt as well as echo "" > pianobarout.txt, but both cause the system's resources to spike heavily for almost 30 seconds, causing the audio from pianobar to skip. I tried removing the file, but it appears that the file is not recreated after being deleted, and I just lose the pipe.
I'm working from python, so if any library there can help, those are available to me.
Any ideas?
If you are currently redirecting with truncation, like yourprogram > file.txt, try redirecting with appending: yourprogram >> file.txt.
There is a big difference between the two when the output file is truncated.
With appending redirection, data is written to the current end of the file. If you truncate it to 0 bytes, the next write will happen at position 0.
With truncating redirection, data is written wherever the last write left off in the file. If you truncate it to 0 bytes, writes will continue at byte 1073741824 where it last left off.
This results in a sparse file if the filesystem supports it (ext2-4 and most Unix fs do), or a long wait while the file is written out if it doesn't (like fat32). A long wait could also be caused by anything following the file, such as tail -f, which has to potentially catch up by reading a GB of zeroes.
Alternatives include yourprogram | split -b 1G - output-, which will write 1GB each to output-aa, output-ab, etc, letting you delete old files at your leasure.
I'm on a shared server with restricted disk space and i've got a gz file that super expands into a HUGE file, more than what i've got. How can I extract it "portion" by "portion (lets say 10 MB at a time), and process each portion, without extracting the whole thing even temporarily!
No, this is just ONE super huge compressed file, not a set of files please...
Hi David, your solution looks quite elegant, but if i'm readying it right, it seems like every time gunzip extracts from the beginning of the file (and the output of that is thrown away). I'm sure that'll be causing a huge strain on the shared server i'm on (i dont think its "reading ahead" at all) - do you have any insights on how i can make gunzip "skip" the necessary number of blocks?
If you're doing this with (Unix/Linux) shell tools, you can use gunzip -c to uncompress to stdout, then use dd with the skip and count options to copy only one chunk.
For example:
gunzip -c input.gz | dd bs=10485760 skip=0 count=1 >output
then skip=1, skip=2, etc.
Unfortunately I don't know of an existing Unix command that does exactly what you need. You could do it easily with a little program in any language, e.g. in Python, cutter.py (any language would do just as well, of course):
import sys
try:
size = int(sys.argv[1])
N = int(sys.argv[2])
except (IndexError, ValueError):
print>>sys.stderr, "Use: %s size N" % sys.argv[0]
sys.exit(2)
sys.stdin.seek((N-1) * size)
sys.stdout.write(sys.stdin.read(size))
Now gunzip <huge.gz | python cutter.py 1000000 5 > fifthone will put in file fifthone exactly a million bytes, skipping the first 4 million bytes in the uncompressed stream.