GUNZIP / Extract file "portion by portion" - shell

I'm on a shared server with restricted disk space and i've got a gz file that super expands into a HUGE file, more than what i've got. How can I extract it "portion" by "portion (lets say 10 MB at a time), and process each portion, without extracting the whole thing even temporarily!
No, this is just ONE super huge compressed file, not a set of files please...
Hi David, your solution looks quite elegant, but if i'm readying it right, it seems like every time gunzip extracts from the beginning of the file (and the output of that is thrown away). I'm sure that'll be causing a huge strain on the shared server i'm on (i dont think its "reading ahead" at all) - do you have any insights on how i can make gunzip "skip" the necessary number of blocks?

If you're doing this with (Unix/Linux) shell tools, you can use gunzip -c to uncompress to stdout, then use dd with the skip and count options to copy only one chunk.
For example:
gunzip -c input.gz | dd bs=10485760 skip=0 count=1 >output
then skip=1, skip=2, etc.

Unfortunately I don't know of an existing Unix command that does exactly what you need. You could do it easily with a little program in any language, e.g. in Python, cutter.py (any language would do just as well, of course):
import sys
try:
size = int(sys.argv[1])
N = int(sys.argv[2])
except (IndexError, ValueError):
print>>sys.stderr, "Use: %s size N" % sys.argv[0]
sys.exit(2)
sys.stdin.seek((N-1) * size)
sys.stdout.write(sys.stdin.read(size))
Now gunzip <huge.gz | python cutter.py 1000000 5 > fifthone will put in file fifthone exactly a million bytes, skipping the first 4 million bytes in the uncompressed stream.

Related

Bash script that creates files of a set size

I'm trying to set up a script that will create empty .txt files with the size of 24MB in the /tmp/ directory. The idea behind this script is that Zabbix, a monitoring service, will notice that the directory is full and wipe it completely with the usage of a recovery expression.
However, I'm new to Linux and seem to be stuck on the script that generates the files. This is what I've currently written out.
today="$( date +¨%Y%m%d" )"
number=0
while test -e ¨$today$suffix.txt¨; do
(( ++number ))
suffix=¨$( printf -- %02d ¨$number¨ )
done
fname=¨$today$suffix.txt¨
printf ´Will use ¨%s¨ as filename\n´ ¨$fname¨
printf -c 24m /tmp/testf > ¨$fname¨
I'm thinking what I'm doing wrong has to do with the printf command. But some input, advice and/or directions to a guide to scripting are very welcome.
Many thanks,
Melanchole
I guess that it doesn't matter what bytes are actually in that file, as long as it fills up the temp dir. For that reason, the right tool to create the file is dd, which is available in every Linux distribution, often installed by default.
Check the manpage for different options, but the most important ones are
if: the input file, /dev/zero probably which is just an endless stream of bytes with value zero
of: the output file, you can keep the code you have to generate it
count: number of blocks to copy, just use 24 here
bs: size of each block, use 1MB for that

How to piece together a file splitted when there is not enough disk space

I was sent 194 Gb worth of data split into 1984 files
There is only 37Gb left on my disk and there are no other disks with that amount of free space. Obviously, this is not going to work
cat file.tar.gz.part* > file.tar.gz
Looking for a way to incrementally piece this huge file together
I might end up writing the script myself but posting here for the community
We need to assume the large file was split using a naming convention
Original file=LargeFile.bin
Split files=(LargeFile.split.aaa, LargeFile.split.aab, ...)
The script to recover would then be:
outfile=LargeFile.recovered.bin
for i in LargeFile.split.* ; do
cat ${i} >> ${outfile}
rm -f ${i}
done
Simple but handy when there is not enough space to do it in one move

How to split an mbox file into n-MB big chunks using the terminal?

So I've read through this question on SO but it does not quite help me any. I want to import a Gmail generated mbox file into another webmail service, but the problem is it only allows 40 MB huge files per import.
So I somehow have to split the mbox file into max. 40 MB big files and import them one after another. How would you do this?
My initial thought was to use the other script (formail) to save each mail as a single file and afterwards run a script to combine them to 40 MB huge files, but still I wouldnt know how to do this using the terminal.
I also looked at the split command, but Im afraid it would cutoff mails.
Thanks for any help!
I just improved a script from Mark Sechell's answer. As We can see, that script can parse the mbox file based on the amount of email per chunk. This improved script can parse the mbox file based on the defined-maximum-size for each chunk.
So, if you have size limitation in uploading or importing the mbox file, you can try the script below to split the mbox file into chunks with specified size*.
Save the script below to a text file, e.g. mboxsplit.txt, in the directory that contains the mbox file (e.g. named mbox):
BEGIN{chunk=0;filesize=0;}
/^From /{
if(filesize>=40000000){#file size per chunk in byte
close("chunk_" chunk ".txt");
filesize=0;
chunk++;
}
}
{filesize+=length()}
{print > ("chunk_" chunk ".txt")}
And then run/type this line in that directory (contains the mboxsplit.txt and the mbox file):
awk -f mboxsplit.txt mbox
Please note:
The size of the result may be larger than the defined size. It depends on the last email size inserted into the buffer/chunk before checking the chunk size.
It will not split the email body
One chunk may contain only one email if the email size is larger than the specified chunk size
I suggest you to specify the chunk size less or lower than the maximum upload/import size.
If your mbox is in standard format, each message will begin with From and a space:
From someone#somewhere.com
So, you could COPY YOUR MBOX TO A TEMPORARY DIRECTORY and try using awk to process it, on a message-by-message basis, only splitting at the start of any message. Let's say we went for 1,000 messages per output file:
awk 'BEGIN{chunk=0} /^From /{msgs++;if(msgs==1000){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}' mbox
then you will get output files called chunk_1.txt to chunk_n.txt each containing up to 1,000 messages.
If you are unfortunate enough to be on Windows (which is incapable of understanding single quotes), you will need to save the following in a file called awk.txt
BEGIN{chunk=0} /^From /{msgs++;if(msgs==1000){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}
and then type
awk -f awk.txt mbox
formail is perfectly suited for this task. You may look at formail's +skip and -total options
Options
...
+skip
Skip the first skip messages while splitting.
-total
Output at most total messages while splitting.
Depending on the size of your mailbox and mails, you may try
formail -100 -s <google.mbox >import-01.mbox
formail +100 -100 -s <google.mbox >import-02.mbox
formail +200 -100 -s <google.mbox >import-03.mbox
etc.
The parts need not be of equal size, of course. If there's one large e-mail, you may have only formail +100 -60 -s <google.mbox >import-02.mbox, or if there are many small messages, maybe formail +100 -500 -s <google.mbox >import-02.mbox.
To look for an initial number of mails per chunk, try
formail -100 -s <google.mbox | wc
formail -500 -s <google.mbox | wc
formail -1000 -s <google.mbox | wc
You may need to experiment a bit, in order to accommodate to your mailbox size. On the other hand, since this seems to be a one time task, you may not want to spend too much time on this.
My initial thought was to use the other script (formail) to save each mail as a single file and afterwards run a script to combine them to 40 MB huge files, but still I wouldnt know how to do this using the terminal.
If I understand you correctly, you want to split the files up, then combine them into a big file before importing them. That sounds like what split and cat were meant to do. Split splits the files based upon your size specification whether based upon line or bytes. It then adds a suffix to these files to keep them in order, You then use cat to put the files back together:
$ split -b40m -a5 mbox # this makes mbox.aaaaa, mbox.aaab, etc.
Once you get the files on the other system:
$ cat mbox.* > mbox
You wouldn't do this if you want to break the files so messages aren't split between files because you are going to import each file into the new mail system one at a time.

Smart split file with gzipping each part?

I have a very long file with numbers. Something like output of this perl program:
perl -le 'print int(rand() * 1000000) for 1..10'
but way longer - around hundreds of gigabytes.
I need to split this file into many others. For test purposes, let's assume that 100 files, and output file number is taken by taking module of number with 100.
With normal files, I can do it simply with:
perl -le 'print int(rand() * 1000000) for 1..1000' | awk '{z=$1%100; print > z}'
But I have a problem when I need to compress splitted parts. Normally, I could:
... | awk '{z=$1%100; print | "gzip -c - > "z".txt.gz"}'
But, when ulimit is configured to allow less open files than number of "partitions", awk breaks with:
awk: (FILENAME=- FNR=30) fatal: can't open pipe `gzip -c - > 60.txt.gz' for output (Too many open files)
This doesn't break with normal file output, as GNU awk is apparently smart enough to recycle file handles.
Do you know any way (aside from writing my own stream-splitting-program, implementing buffering, and some sort of pool-of-filehandles management) to handle such case - that is: splitting to multiple files, where access to output files is random, and gzipping all output partitions on the fly?
I didn't write it in question itself, but since the additional information is together with solution, I'll write it all here.
So - the problem was on Solaris. Apparently there is a limitation, that no program using stdio on Solaris can have more than 256 open filehandles ?!
It is described in here in detail. The important point is that it's enough to set one env variable before running my problematic program, and the problem is gone:
export LD_PRELOAD_32=/usr/lib/extendedFILE.so.1

How do I split a large file in unix repeatedly?

I have a situation with a failing LaCie 500GB hard drive. It stays on for only about 10 minutes, then becomes unusable. For those 10 minutes or so I do have complete control.
I can't get my main mov file(160GB) transferred off that quickly, so I was thinking if I split it into small chunks, I could move them all off. I tried splitting the movie file using the SPLIT command, but it of course took longer than 10 minutes. I ended up with about 14GBs of files 2GB each before it failed.
Is there a way I can use a split command and skip any existing files chunks, so as I'm splitting this file it will see xaa, xab, xac and start after that point so it will continue to split the file starting with xad?
Or is there a better option that can split a file in multiple stages? I looked at csplit as well, but that didn't seem like an option either.
Thanks!
-------- UPDATE ------------
Now with the help of bcat and Mark I was able to do this using the following
dd if=/Volumes/badharddrive/file.mov of=/Volumes/mainharddrive/movieparts/moviepart1 bs=1g count=4
dd if=/Volumes/badharddrive/file.mov of=/Volumes/mainharddrive/movieparts/moviepart2 bs=1g count=4 skip=4
dd if=/Volumes/badharddrive/file.mov of=/Volumes/mainharddrive/movieparts/moviepart3 bs=1g count=4 skip=8
etc
cat /Volumes/mainharddrive/movieparts/moviepart[1-3] -> newmovie.mov
You can always use the dd command to copy chunks of the old file into a new location. This has the added benefit of not doing unnecessary writes to the failing drive. Using dd like this could get tedious with such a large mov file, but you should be able to write a simple shell script to automate part of the process.
Duh! bcat's answer is way better than mine, but since I wrote some code I figured I'd go ahead and post it.
input = ARGV[0]
length = ARGV[1].to_i
offset = ARGV[2].to_i
File.open "#{input}-#{offset}-#{length}", 'w' do |file|
file.write(File.read input, length, offset)
end
Use it like this:
$ ruby test.rb input_file length offset

Resources