IO::Uncompress::Gunzip stops after first "original" gzipped file inside "concatenated" gzipped file - bash

In bash, you can concatenate gzipped files and the result is a valid gzipped file. As far as I recall, I have always been able to treat these "concatenated" gzipped files as normal gzipped files (my example code from link above):
echo 'Hello world!' > hello.txt
echo 'Howdy world!' > howdy.txt
gzip hello.txt
gzip howdy.txt
cat hello.txt.gz howdy.txt.gz > greetings.txt.gz
gunzip greetings.txt.gz
cat greetings.txt
Which outputs
Hello world!
Howdy world!
However, when trying to read this same file using Perl's core IO::Uncompress::Gunzip module, it doesn't get past the first original file. Here is the result:
./my_zcat greetings.txt.gz
Hello world!
Here is the code for my_zcat:
#!/bin/env perl
use strict;
use warnings;
use v5.10;
use IO::Uncompress::Gunzip qw($GunzipError);
my $file_name = shift;
my $fh = IO::Uncompress::Gunzip->new($file_name) or die $GunzipError;
while (defined(my $line = readline $fh))
{
print $line;
}
If I totally decompress the files before creating a new gzipped file, I don't have this problem:
zcat hello.txt.gz howdy.txt.gz | gzip > greetings_via_zcat.txt.gz
./my_zcat greetings_via_zcat.txt.gz
Hello world!
Howdy world!
So, what is the difference between greetings.txt.gz and greetings_via_zcat.txt.gz and why might IO::Uncompress::Gunzip work correctly with greetings.txt.gz?
Based on this answer to another question, I'm guessing that IO::Uncompress::Gunzip messes up because of the metadata between the files. But, since greetings.txt.gz is a valid Gzip file, I would expect IO::Uncompress::Gunzip to work.
My workaround for now will be piping from zcat (which of course doesn't help Windows users much):
#!/bin/env perl
use strict;
use warnings;
use v5.10;
my $file_name = shift;
open(my $fh, '-|', "zcat $file_name");
while (defined(my $line = readline $fh))
{
print $line;
}

This is covered explicitly in the IO::Compress FAQ section Dealing with concatenated gzip files. Basically you just have to include the MultiStream option when you construct the IO::Uncompress::Gunzip object.
Here is a definition of the MultiStream option:
MultiStream => 0|1
If the input file/buffer contains multiple
compressed data streams, this option will uncompress the whole lot as
a single data stream.
Defaults to 0.
So your code needs this change
my $fh = IO::Uncompress::Gunzip->new($file_name, MultiStream => 1) or die $GunzipError;

Related

Splitting very long (4GB) string with new lines

I have a file that is supposed to be JSON objects, one per line. Unfortunately, a miscommunication happened with the creation of the file, and the JSON objects only have a space between them, not a new-line.
I need to fix this by replacing every instance of } { with }\n{.
Should be easy for sed or Perl, right?
sed -e "s/}\s{/}\n{/g" file.in > file.out
perl -pe "s/}\s{/}\n{/g" file.in > file.out
But file.in is actually 4.4 GB which seems to be causing a problem for both of these solutions.
The sed command finishes with a halfway-correct file, but file.out is only 335 MB and is only about the first 1/10th of the input file, cutting off in the middle of a line. It's almost like sed just quit in the middle of the stream. Maybe it's trying to load the entire 4.4 GB file into memory but running out of stack space at around 300MB and silently kills itself.
The Perl command errors with the following message:
[1] 2904 segmentation fault perl -pe "s/}\s{/}\n{/g" file.in > file.out
What else should I try?
Unlike the earlier solutions, this one handles {"x":"} {"}.
use strict;
use warnings;
use feature qw( say );
use JSON::XS qw( );
use constant READ_SIZE => 64*1024*1024;
my $j_in = JSON::XS->new->utf8;
my $j_out = JSON::XS->new;
binmode STDIN;
binmode STDOUT, ':encoding(UTF-8)';
while (1) {
my $rv = sysread(\*STDIN, my $block, READ_SIZE);
die($!) if !defined($rv);
last if !$rv;
$j_in->incr_parse($block);
while (my $o = $j_in->incr_parse()) {
say $j_out->encode($o);
}
}
die("Bad data") if $j_in->incr_text !~ /^\s*\z/;
The default input record separator in Perl is \n, but you can change it to any character you want. For this problem, you could use { (octal 173).
perl -0173 -pe 's/}\s{/}\n{/g' file.in > file.out
perl -ple 'BEGIN{$/=qq/} {/;$\=qq/}\n{/}undef$\ if eof' <input >output
Assuming your input doesn't contain } { pairs in other contexts that you do not want replaced, ll you need is:
awk -v RS='} {' '{ORS=(RT ? "}\n{" : "\n")} 1'
e.g.
$ printf '{foo} {bar}' | awk -v RS='} {' '{ORS=(RT ? "}\n{" : "\n")} 1'
{foo}
{bar}
The above uses GNU awk for multi-char RS and RT and will work on any size input file as it does not read the whole file into memory at one time, just each } {-separated "line" one at a time.
You may read input in blocks/chunks and process them one by one.
use strict;
use warnings;
binmode(STDIN);
binmode(STDOUT);
my $CHUNK=0x2000; # 8kiB
my $buffer = '';
while( sysread(STDIN, $buffer, $CHUNK, length($buffer))) {
$buffer =~ s/\}\s\{/}\n{/sg;
if( length($buffer) > $CHUNK) { # More than one chunk buffered
syswrite( STDOUT, $buffer, $CHUNK); # write FIRST of buffered chunks
substr($buffer,0,$CHUNK,''); # remove FIRST of buffered chunks from buffer
}
}
syswrite( STDOUT, $buffer) if length($buffer);

How to delete text in text file in Windows using Perl?

I want to port my perl application to windows.
Currently it calls out to "grep" to delete text in a text file, like so:
system("grep -v '$mcadd' $ARGV[0] >> $ARGV[0].bak");
system("mv $ARGV[0].bak $ARGV[0]");
This works perfectly well in ubuntu, but I'm not sure (a) how to modify my perl script to achieve the same effect on Windows, and (b) whether there is a way to achieve the effect in a way that will work in both environments.
Other way to delete text in perl?
You can use perl's inplace editing facility.
~/pperl_programs$ cat data.txt
hello world
goodbye mars
goodbye perl6
back to perl5
Run this:
use strict;
use warnings;
use 5.020;
my $fname = 'data.txt';
#Always use three arg form of open().
#Don't use bareword filehandles.
#open my $INFILE, '<', $fname
# or die "Couldn't open $fname: $!";
{
local $^I = ".bak"; #Turn on inplace editing for this block only
local #ARGV = $fname; #Set #ARGV for this block only
while (my $line = <>) { #"diamond operator" reads from #ARGV
if ($line !~ /hello/) {
print $line; #This does not go to STDOUT--it goes to a new file that perl creates for you.
}
}
} #Return $^I and #ARGV to their previous values
#close $INFILE;
Here is the result:
$ cat data.txt
goodbye mars
goodbye perl6
back to perl5
With inplace editing turned on, perl takes care of creating a new file, sending print() output to the new file, then when you are done, renaming the new file to the original file name, and saving a copy of the original file with a .bak extension.
system("perl -n -e 'if(\$_ !~ /$mcadd/) { print \$_; }' \$ARGV[0] >> \$ARGV[0].bak");
system("rename \$ARGV[0].bak \$ARGV[0]");
This should work in windows.

Extract stream of bytes between two strings using sed in bash

I have a single binary file with multiple images in it. Every image area starts with an ASCII string.The problem is the bytes between the strings are not text/ASCII but plane binary bytes.
So the binary data between "string1" and "string2" is my image-1 and so on.
How do I extract each images out in bash? may be using 'sed'?.
Please Help.
Here's a perl version. Put it in a file "myprog", edit the header to what you want, chmod +x myprog, and do ./myprog <yourdatafile. It creates files out1, out2, etc. It assumes the file starts with the header, or you can ignore the first part.
#!/usr/bin/perl
use strict;
my $data = join('',<STDIN>);
my $i = 1;
my $header = "abc";
foreach my $part (split($header,$data)){
open(OUT,">out$i");
print OUT $header,$part;
close(OUT);
$i++;
}

append a text on the top of the file

I want to add a text on the top of my data.txt file, this code add the text at the end of the file. how I can modify this code to write the text on the top of my data.txt file. thanks in advance for any assistance.
open (MYFILE, '>>data.txt');
print MYFILE "Title\n";
close (MYFILE)
perl -pi -e 'print "Title\n" if $. == 1' data.text
Your syntax is slightly off deprecated (thanks, Seth):
open(MYFILE, '>>', "data.txt") or die $!;
You will have to make a full pass through the file and write out the desired data before the existing file contents:
open my $in, '<', $file or die "Can't read old file: $!";
open my $out, '>', "$file.new" or die "Can't write new file: $!";
print $out "# Add this line to the top\n"; # <--- HERE'S THE MAGIC
while( <$in> ) {
print $out $_;
}
close $out;
close $in;
unlink($file);
rename("$file.new", $file);
(gratuitously stolen from the Perl FAQ, then modified)
This will process the file line-by-line so that on large files you don't chew up a ton of memory. But, it's not exactly fast.
Hope that helps.
There is a much simpler one-liner to prepend a block of text to every file. Let's say you have a set of files named body1, body2, body3, etc, to which you want to prepend a block of text contained in a file called header:
cat header | perl -0 -i -pe 'BEGIN {$h = <STDIN>}; print $h' body*
Appending to the top is normally called prepending.
open(M,"<","data.txt");
#m = <M>;
close(M);
open(M,">","data.txt");
print M "foo\n";
print M #m;
close(M);
Alternately open data.txt- for writing and then move data.txt- to data.txt after the close, which has the benefit of being atomic so interruptions cannot leave the data.txt file truncated.
See the Perl FAQ Entry on this topic
perl -ni -e 'print "Title\n" $.==1' filename , this print the answer once

Storing files inside BASH scripts

Is there a way to store binary data inside a BASH script so that it can be piped to a program later in that script?
At the moment (on Mac OS X) I'm doing
play sound.m4a
# do stuff
I'd like to be able to do something like:
SOUND <<< the m4a data, encoded somehow?
END
echo $SOUND | play
#do stuff
Is there a way to do this?
Base64 encode it. For example:
$ openssl base64 < sound.m4a
and then in the script:
S=<<SOUND
YOURBASE64GOESHERE
SOUND
echo $S | openssl base64 -d | play
I know this is like riding a dead horse since this post is rather old, but I'd like to improve Sionide21 answer as his solution stores the binary data in a variable which is not necessary.
openssl base64 -d <<SOUND | play
YOURBASE64DATAHERE
SOUND
Note: HereDoc Syntax requires that you don't indent the last 'SOUND'
and base64 decoding sometimes failed on me when i indented that
'YOURBASE64DATAHERE' section. So it's best practice to keep the Base64
Data as well the end-token unindented.
I've found this looking for a more elegant way to store binary data in shell scripts, but i had already solved it like described here. Only difference is I'm transporting some tar-bzipped files this way. My platform knows a separate base64 binary so I don't have to use openssl.
base64 -d <<EOF | tar xj
BASE64ENCODEDTBZ
EOF
There is a Unix format called shar (shell archive) that allows you to store binary data in a shell script. You create a shar file using the shar command.
When I've done this I've used a shell here document piped through atob.
function emit_binary {
cat << 'EOF' | atob
--junk emitted by btoa here
EOF
}
the single quotes around 'EOF' prevent parameter expansion in the body of the here document.
atob and btoa are very old programs, and for some reason they are often absent from modern Unix distributions. A somewhat less efficient but more ubiquitous alternative is to use mimencode -b instead of btoa. mimencode will encode into base64 ASCII. The corresponding decoding command is mimencode -b -u instead of atob. The openssl command will also do base64 encoding.
Here's some code I wrote a long time ago that packs a choice executable into a bash script. I can't remember exactly how it works, but I suspect you could pretty easily modify it to do what you want.
#!/usr/bin/perl
use strict;
print "Stub Creator 1.0\n";
unless($#ARGV == 1)
{
print "Invalid argument count, usage: ./makestub.pl InputExecutable OutputCompressedExecutable\n";
exit;
}
unless(-r $ARGV[0])
{
die "Unable to read input file $ARGV[0]: $!\n";
}
my $OUTFILE;
open(OUTFILE, ">$ARGV[1]") or die "Unable to create $ARGV[1]: $!\n";
print "\nCreating stub script...";
print OUTFILE "#!/bin/bash\n";
print OUTFILE "a=/tmp/\`date +%s%N\`;tail -n+3 \$0 | zcat > \$a;chmod 700 \$a;\$a \${*};rm -f \$a;exit;\n";
close(OUTFILE);
print "done.\nCompressing input executable and appending...";
`gzip $ARGV[0] -n --best -c >> $ARGV[1]`;
`chmod +x $ARGV[1]`;
my $OrigSize;
$OrigSize = -s $ARGV[0];
my $NewSize;
$NewSize = -s $ARGV[1];
my $Temp;
if($OrigSize == 0)
{
$NewSize = 1;
}
$Temp = ($NewSize / $OrigSize) * 100;
$Temp *= 1000;
$Temp = int($Temp);
$Temp /= 1000;
print "done.\nStub successfully composed!\n\n";
print <<THEEND;
Original size: $OrigSize
New size: $NewSize
Compression: $Temp\%
THEEND
If it's a single block of data to use, the trick I've used is to put a "start of data" marker at the end of the file, then use sed in the script to filter out the leading stuff. For example, create the following as "play-sound.bash":
#!/bin/bash
sed '1,/^START OF DATA/d' $0 | play
exit 0
START OF DATA
Then, you can just append your data to the end of this file:
cat sound.m4a >> play-sound.bash
and now, executing the script should play the sound directly.
Since Python is available on OS X by default, you can do as below:
ENCODED=$(python -m base64 foo.m4a)
Then decode it as below:
echo $ENCODED | python -m base64 -d | play

Resources