Splitting very long (4GB) string with new lines - bash

I have a file that is supposed to be JSON objects, one per line. Unfortunately, a miscommunication happened with the creation of the file, and the JSON objects only have a space between them, not a new-line.
I need to fix this by replacing every instance of } { with }\n{.
Should be easy for sed or Perl, right?
sed -e "s/}\s{/}\n{/g" file.in > file.out
perl -pe "s/}\s{/}\n{/g" file.in > file.out
But file.in is actually 4.4 GB which seems to be causing a problem for both of these solutions.
The sed command finishes with a halfway-correct file, but file.out is only 335 MB and is only about the first 1/10th of the input file, cutting off in the middle of a line. It's almost like sed just quit in the middle of the stream. Maybe it's trying to load the entire 4.4 GB file into memory but running out of stack space at around 300MB and silently kills itself.
The Perl command errors with the following message:
[1] 2904 segmentation fault perl -pe "s/}\s{/}\n{/g" file.in > file.out
What else should I try?

Unlike the earlier solutions, this one handles {"x":"} {"}.
use strict;
use warnings;
use feature qw( say );
use JSON::XS qw( );
use constant READ_SIZE => 64*1024*1024;
my $j_in = JSON::XS->new->utf8;
my $j_out = JSON::XS->new;
binmode STDIN;
binmode STDOUT, ':encoding(UTF-8)';
while (1) {
my $rv = sysread(\*STDIN, my $block, READ_SIZE);
die($!) if !defined($rv);
last if !$rv;
$j_in->incr_parse($block);
while (my $o = $j_in->incr_parse()) {
say $j_out->encode($o);
}
}
die("Bad data") if $j_in->incr_text !~ /^\s*\z/;

The default input record separator in Perl is \n, but you can change it to any character you want. For this problem, you could use { (octal 173).
perl -0173 -pe 's/}\s{/}\n{/g' file.in > file.out

perl -ple 'BEGIN{$/=qq/} {/;$\=qq/}\n{/}undef$\ if eof' <input >output

Assuming your input doesn't contain } { pairs in other contexts that you do not want replaced, ll you need is:
awk -v RS='} {' '{ORS=(RT ? "}\n{" : "\n")} 1'
e.g.
$ printf '{foo} {bar}' | awk -v RS='} {' '{ORS=(RT ? "}\n{" : "\n")} 1'
{foo}
{bar}
The above uses GNU awk for multi-char RS and RT and will work on any size input file as it does not read the whole file into memory at one time, just each } {-separated "line" one at a time.

You may read input in blocks/chunks and process them one by one.
use strict;
use warnings;
binmode(STDIN);
binmode(STDOUT);
my $CHUNK=0x2000; # 8kiB
my $buffer = '';
while( sysread(STDIN, $buffer, $CHUNK, length($buffer))) {
$buffer =~ s/\}\s\{/}\n{/sg;
if( length($buffer) > $CHUNK) { # More than one chunk buffered
syswrite( STDOUT, $buffer, $CHUNK); # write FIRST of buffered chunks
substr($buffer,0,$CHUNK,''); # remove FIRST of buffered chunks from buffer
}
}
syswrite( STDOUT, $buffer) if length($buffer);

Related

split a large file by a delimiter- out of memory

I want to split a large file to many files based on a delimiter. The delimiter I am aiming in my input file is // (double forward slash in a newline). Part of my file is like
..
...
7141 gatttaggca gtgaaaactt agtagccgac aaggtgaaag atgccgagaa tgtactaagg
7201 gtaaaggcag ctaaaacaga ctttaccgat agcaccaacc tatcggtcat cactcaagac
7261 ggaggctttt atagctttga ggtgagttat cacaccacgc cacaacctct taccattgat
7321 tttggtagag gaatgcccca aggcaataat gtgaaatcgg atattctctt ttctgacaca
7381 ggctgggaat cacctgcggt agcacagatt attatgtcgt ctatct
//
LOCUS KE150251 6962 bp DNA linear CON
14-JUN-2013
DEFINITION Capnocytophaga granulosa ATCC 51502 genomic scaffold
acFDk-supercont1.18/ whole genome shotgun sequence.
...
..
I also want to include these slashes as the last line of the generated files.
I failed do it by csplit in my Mac, and end up with the following awk script:
awk -v RS='^//' '{ outfile = "output_file_" NR; print > outfile}' Input.gbk
But I am getting following error:
awk(56213,0x7fffb585b3c0) malloc: ***
mach_vm_map(size=18446744071562067968) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
awk: out of memory in readrec 1
source line number 1
Thanks for your suggestions!
Better to use a library to parse large GenBank files. Here's one way using Bio::SeqIO::genbank, which returns Bio::Seq objects and writes them to separate files by display id. Put the following into a file called split_genbank.pl:
#!/usr/bin/env perl
use strict;
use warnings;
use Bio::SeqIO::genbank;
my $stream = Bio::SeqIO->new(-file => $ARGV[0], -format => 'GenBank');
while ( my $seq = $stream->next_seq ) {
my $id = $seq->display_id();
my $out = Bio::SeqIO->new(-format => 'GenBank', -file => ">$id.gbk");
$out->write_seq($seq);
}
Then call it using:
perl split_genbank.pl input.gbk
I believe since you have NOT closed files(new output files) they are eating up the memory. Could you please try following once.
awk -v RS='^//' '{close(outfile)} {outfile = "output_file_" NR; print > outfile}' Input.gbk
EDIT: one more try with another approach. Since I believe your file have many lines between // so memory is getting filled up by RS so better to use a flag approach rather than RS approach.
awk -v outfile="output_file_1" -v count=1 '/^\/\//{print > outfile; close(outfile);outfile = "output_file_" ++count;next} {print > (outfile)}' Input.gbk
Explanation of above approach: Checking for line which starts from // and increment value in outputfile name and reset value of output file name variable, also I am closing output file here else you may get error too many files opened in background too.
By setting RS, you make awk read in data until the separator. You say your file is large, so it may be that the resulting records are bigger than the memory available to awk for processing.
For your application, you could use the default value for RS and compute the effective NR by hand by incrementing a counter whenever the delimiter is read:
awk '
BEGIN {
pre = "output_file_"
n = 1
outfile = pre n
}
{
print > outfile
}
/^\/\// {
close(outfile)
n++
outfile = pre n
}
' Input.gbk
Since you have access to GNU csplit.
You can use it:
csplit Input.gbk '/^\/\//+1' '{*}'
Your original command doesn't work because csplit expects a regular expression, not a fixed string.

bash: transform scaffold fasta

I have a fasta file with the following sequences:
>NZ_OCNF01123018.1
TACAAATACAACAAATACAAGTACACCAAGTACAAATACAAGTATCCCAAGTACAAATACAAGTA
TCCCAAGTACAAATACAAGTATTCCAAGTACAAATACAAAACCTGTTGAGCAACCTAAACCTGTTGAAC
AGCCCAAACCTGTTGAACAGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAACCTTTATCCGCACTTA
CGAGCAAATACACCAATACCGCTTTATCGGCACAGTCTGCCCAAATTGACGGATGCACCATGTTACCCAACAC
ATCAATCAACGTTTGTGGGATCACCTGAAAAAGGGCGCGGTTTGTGGTTGATG
>NZ_OCNF01123018.2
AATTGTCGTGTAAAGCCACACCAAACCCCATTATAGCCCCAAAAACACCAAAAAGGCTGCCTGAACCACATTTCAGACAG
And I want to split the all sequences in the file that contain multiple N at the site where it occurs and make two sequences out of it.
Expected solution:
>NZ_OCNF01123018.1
TACAAATACAACAAATACAAGTACACCAAGTACAAATACAAGTATCCCAAGTACAAATACAAGTA
TCCCAAGTACAAATACAAGTATTCCAAGTACAAATACAAAACCTGTTGAGCAACCTAAACCTGTTGAAC
AGCCCAAACCTGTTGAACAGC
>contig1
AAACCTTTATCCGCACTTA
CGAGCAAATACACCAATACCGCTTTATCGGCACAGTCTGCCCAAATTGACGGATGCACCATGTTACCCAACAC
ATCAATCAACGTTTGTGGGATCACCTGAAAAAGGGCGCGGTTTGTGGTTGATG
>NZ_OCNF01123018.2
AATTGTCGTGTAAAGCCACACCAAACCCCATTATAGCCCCAAAAACACCAAAAAGGCTGCCTGAACCACATTTCAGACAG
my (inelegant) approach would be this:
perl -pe 's/[N]+/\*/g' $file | perl -pe 's/\*/\n>contig1\n/g'
of course that also replaces the N of the sequence header and creates headers without a sequence. As a plus, it would be nice to number the new 'contigs' from 1 to x in case there are multiple sequences with N.
What do you suggest?
I'd suggest to use split instead of trying to get a regex just right, and in a script instead of a brittle and crammed "one"-liner.
use warnings;
use strict;
use feature 'say';
my $file = shift #ARGV;
die "Usage: $0 filename\n" if !$file; # also check submitted $file
my $content = do { # or: my $content = Path::Tiny::path($file)->slurp;
local $/;
open my $fh, '<', $file or die "Can't open $file: $!";
<$fh>;
};
my #f = grep { /\S/ } split /(?<!>)NN+/, $content;
say shift #f;
my $cnt;
for (#f) {
say "\n>contig", (++$cnt), ":\n$_";
}
This slurps the file into $content since NN+ can span multiple lines; Path::Tiny module can make that cleaner. The first element of the obtained array needs no >contig so it is shifted away.
The negative lookbehind (?<!...) makes the regex in split's separator pattern match NN+ only when not preceded by >, thus protecting (excluding) header lines that may start with that. If headers may contain consecutive N which are not right after > then you need to refine this.
I expaned your perl one-liner a bit:
cat file.fasta | \
perl -pe 's/\n//g unless /^>/; s/>/\n>/g;' | \
perl -pe 's/N+(?{$n++})/\n>contig${n}\n/g unless /^>/'
the first part is to remove newlines between bases, the second part is to replace continuous 'N'.

Splitting large text file on every blank line

I'm having a bit trouble of splitting a large text file into multiple smaller ones. Syntax of my text file is the following:
dasdas #42319 blaablaa 50 50
content content
more content
content conclusion
asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion
asdasd #299 yadayada 60 40
content
content
contend done
...and so on
A typical information table in my file has anywhere between 10-40 rows.
I would like this file to be split in n smaller files, where n is the amount of content tables.
That is
dasdas #42319 blaablaa 50 50
content content
more content
content conclusion
would be its own separate file, (whateverN.txt)
and
asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion
again a separate file whateverN+1.txt and so forth.
It seems like awk or Perl are nifty tools for this, but having never used them before the syntax is kinda baffling.
I found these two questions that are almost correspondent to my problem, but failed to modify the syntax to fit my needs:
Split text file into multiple files & How can I split a text file into multiple text files? (on Unix & Linux)
How should one modify the command line inputs, so that it solves my problem?
Setting RS to null tells awk to use one or more blank lines as the record separator. Then you can simply use NR to set the name of the file corresponding to each new record:
awk -v RS= '{print > ("whatever-" NR ".txt")}' file.txt
RS:
This is awk's input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines, or a regexp, in which case records are separated by matches of the regexp in the input text.
$ cat file.txt
dasdas #42319 blaablaa 50 50
content content
more content
content conclusion
asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion
asdasd #299 yadayada 60 40
content
content
contend done
$ awk -v RS= '{print > ("whatever-" NR ".txt")}' file.txt
$ ls whatever-*.txt
whatever-1.txt whatever-2.txt whatever-3.txt
$ cat whatever-1.txt
dasdas #42319 blaablaa 50 50
content content
more content
content conclusion
$ cat whatever-2.txt
asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion
$ cat whatever-3.txt
asdasd #299 yadayada 60 40
content
content
contend done
$
You could use the csplit command:
csplit \
--quiet \
--prefix=whatever \
--suffix-format=%02d.txt \
--suppress-matched \
infile.txt /^$/ {*}
POSIX csplit only uses short options and doesn't know --suffix and --suppress-matched, so this requires GNU csplit.
This is what the options do:
--quiet – suppress output of file sizes
--prefix=whatever – use whatever instead fo the default xx filename prefix
--suffix-format=%02d.txt – append .txt to the default two digit suffix
--suppress-matched – don't include the lines matching the pattern on which the input is split
/^$/ {*} – split on pattern "empty line" (/^$/) as often as possible ({*})
Perl has a useful feature called the input record separator. $/.
This is the 'marker' for separating records when reading a file.
So:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "\n\n";
my $count = 0;
while ( my $chunk = <> ) {
open ( my $output, '>', "filename_".$count++ ) or die $!;
print {$output} $chunk;
close ( $output );
}
Just like that. The <> is the 'magic' filehandle, in that it reads piped data or from files specified on command line (opens them and reads them). This is similar to how sed or grep work.
This can be reduced to a one liner:
perl -00 -pe 'open ( $out, '>', "filename_".++$n ); select $out;' yourfilename_here
You can use this awk,
awk 'BEGIN{file="content"++i".txt"} !NF{file="content"++i".txt";next} {print > file}' yourfile
(OR)
awk 'BEGIN{i++} !NF{++i;next} {print > "filename"i".txt"}' yourfile
More readable format:
BEGIN {
file="content"++i".txt"
}
!NF {
file="content"++i".txt";
next
}
{
print > file
}
In case you get "too many open files" error as follows...
awk: whatever-18.txt makes too many open files
input record number 18, file file.txt
source line number 1
You may need to close newly created file, before creating a new one, as follows.
awk -v RS= '{close("whatever-" i ".txt"); i++}{print > ("whatever-" i ".txt")}' file.txt
Since it's Friday and I'm feeling a bit helpful... :)
Try this. If the file is as small as you imply it's simplest to just read it all at once and work in memory.
use strict;
use warnings;
# slurp file
local $/ = undef;
open my $fh, '<', 'test.txt' or die $!;
my $text = <$fh>;
close $fh;
# split on double new line
my #chunks = split(/\n\n/, $text);
# make new files from chunks
my $count = 1;
for my $chunk (#chunks) {
open my $ofh, '>', "whatever$count.txt" or die $!;
print $ofh $chunk, "\n";
close $ofh;
$count++;
}
The perl docs can explain any individual commands you don't understand but at this point you should probably look into a tutorial as well.
awk -v RS="\n\n" '{for (i=1;i<=NR;i++); print > i-1}' file.txt
Sets record separator as blank line, prints each record as a separate file numbered 1, 2, 3, etc. Last file (only) ends in blank line.
Try this bash script also
#!/bin/bash
i=1
fileName="OutputFile_$i"
while read line ; do
if [ "$line" == "" ] ; then
((++i))
fileName="OutputFile_$i"
else
echo $line >> "$fileName"
fi
done < InputFile.txt
You can also try split -p "^$"

Taking multiple header (rows matching condition) and convert into a column

Hello I have a file that has multiple Headers in it that I need to have turned into column values. The file looks like this:
Day1
1,Smith,London
2,Bruce,Seattle
5,Will,Dallas
Day2
1,Mike,Frisco
4,James,LA
I would like the file to end up looking like this:
Day1,1,Smith,London
Day1,2,Bruce,Seattle
Day1,5,Will,Dallas
Day2,1,Mike,Frisco
Day2,4,James,LA
The file doesn't have sequential numbers before the names and it doesn't have the same quantity of records after the "Day" Header.
Does anyone have any ideas on how to accomplish this using the command-line?
In awk
awk -F, 'NF==1{a=$0;next}{print a","$0}' file
Checks if the number of fields is 1, if it is it sets a variable to that and skips the next block.
For each line that doesn't have 1 field, it prints the saved variable and the line
And in sed
sed -n '/,/!{h};/,/{x;G;s/\n/,/;p;s/,.*//;x}' file
Broken down for MrBones wild ride.
sed -n '
/,/!{h}; // If the line does not contain a comma overwrite buffer with line
/,/{ // If the line contains a comma, do everything inside the brackets
x; // Exchange the line for the held in buffer
G; // Append buffer to line
s/\n/,/; // Replace the newline with a comma
p; // Print the line
s/,.*//; // Remove everything after the first comma
x // exchange line for hold buffer to put title back in buffer for the next line.
}' file // The file you are using
In essence it saves the lines without a ,, i.e the headers. Then if its not a header, it switches the current line with the saved header and appends the now switched line to the end of the header. As it is appended with a newline, then the next statement replaces that with a comma. Then the line is printed. NExt to recover the header, everything after it is removed and it is swapped back into the buffer, ready for the next line.
sed '/^Day/ {h;d;}
G;s/\(.*\)\n\(.*\)/\2,\1/
' YourFile
posix compliant
print nothing if not at least 1 data after a Day
white line are treated as data
awk '{if ( $0 ~ /^Day/ ) Head = $0; else print Head "," $0}' YourFile
use Day as paragraph separator and content as header to use on following line
Perl solution:
#! /usr/bin/perl
use warnings;
use strict;
my $header;
while (<>) { # Read line by line.
if (/,/) { # If the line contains a comma,
print "$header,$_"; # prepend the header.
} else {
chomp; # Remove the newline.
$header = $_; # Remember the header.
}
}
Another sed version
sed -n '/Day[0-9]\+/{h;b end};{G;s/\(.*\)\n\(.*\)/\2,\1/;p;:end}'
Perl
$ perl -F, -wlane ' if(#F eq 1){$s=$F[0]; next}print "$s,$_"' file
Day1,1,Smith,London
Day1,2,Bruce,Seattle
Day1,5,Will,Dallas
Day2,1,Mike,Frisco
Day2,4,James,LA
This Perl one-line program will do as you ask. It requires Perl v5.14 or better
perl -ne'tr/,// ? print $c,$_ : ($c = s/\s*\z/,/r)' myfile.txt
for earlier versions of perl, use
perl -ne'tr/,// ? print $c,$_ : ($c = $_) =~ s/\s*\z/,/' myfile.txt
output
Day1,1,Smith,London
Day1,2,Bruce,Seattle
Day1,5,Will,Dallas
Day2,1,Mike,Frisco
Day2,4,James,LA
Another perl example- this time using $/ to separate each record.
use strict;
use warnings;
local $/ = "Day";
while (<>) {
next unless my ($num) = m/^(\d+)/;
for ( split /\n/ ) {
print "Day${num},$_\n" if m/,/;
}
}

Why output on perl eval differ between common bash output and redirection STDOUT + STDERR into file?

My code is:
perl -e'
use strict; use warnings;
my $a={};
eval{ test(); };
sub test{
print "11\n";
if( $a->{aa} eq "aa"){
print "aa\n";
}
else{
print "bb\n";
}
}'
Output on Terminal is:
11
Use of uninitialized value in string eq at -e line 9.
bb
If I redirect in file, the output order differ. Why?
perl -e'
...
' > t.log 2>&1
cat t.log:
Use of uninitialized value in string eq at -e line 9.
11
bb
My perl Version:
This is perl 5, version 18, subversion 4 (v5.18.4) built for x86_64-linux-thread-multi
(with 20 registered patches, see perl -V for more detail)
A simpler demonstration of the problem:
$ perl -e'print("abc\n"); warn("def\n");'
abc
def
$ perl -e'print("abc\n"); warn("def\n");' 2>&1 | cat
def
abc
This is due to differences in how STDOUT and STDERR are buffered.
STDERR isn't buffered.
STDOUT flushes its buffer when a newline is encountered if STDOUT is connected to a terminal.
STDOUT flushes its buffer when it's full otherwise.
$| = 1; turns off buffering for STDOUT[1].
$ perl -e'$| = 1; print("abc\n"); warn("def\n");' 2>&1 | cat
abc
def
Actually, the currently selected handle, which is the one print writes to if no handle is specified, which is STDOUT by default.
It's only an autoflush problem NO eval question.
Solution is:
perl -e'
use strict;
use warnings;
$|++; # <== this autoflush print output
my $a={};
test();
sub test{
print "11\n";
if( $a->{aa} eq "aa"){
print "aa\n";
}
else{
print "bb\n";
}
}' > t.log 2>&1
In some cases on terminal is the same problem:
perl -e'print("abc"); print(STDERR "def\n"); print("ghi\n");'
The only save way to get correct order, is turn on autoflush!
#dgw + ikegami ==> thank's

Resources