Parsing csv file and skip the first 3000 lines - algorithm

I did this function to modify my csv file :
sub convert
{
# open the output/input file
my $file = $firstname."_lastname_".$age.".csv";
$file =~ /(.+\/)(.+\.csv)/;
my $file_simple = $2;
open my $in, '<', $file or die "can not read the file: $file $!";
open my $out, '>', $outPut."_lastname.csv" or die "can not open the o file: $!";
$_ = <$in>;
# first line
print $out "X,Y,Z,W\n";
while( <$in> )
{
if(/(-?\d+),(-?\d+),(-?\d+),(-?\d+),(-?\d+)/)
{
my $tmp = ($4.$5);
print $out $2.$sep.$3.$sep.$4.$sep.($5/10)."\n";
}
else
{print $out "Error: ".$_;}
}
close $out;
}
I would like to skip the first 3000 lines and i have no idea to do it,it's my first time using perl.
Thank you.

Since you wish to to skip the first 3000 lines, just use next if in tandem with the current line number variable $.:
use strict; use warnings;
my $skip_lines = 3001;
open(my $fh, '<', 'data.dat') or die $!;
while (<$fh>) {
next if $. < $skip_lines;
//process the file
}
close($fh);
Since $. checks the current line number, this program simply tells perl to start at the 3001st line, effectively skipping 3000 lines. As desired.
$. Current line number for the last filehandle accessed. Each
filehandle in Perl counts the number of lines that have been read from
it. (Depending on the value of $/ , Perl's idea of what constitutes a
line may not match yours.) When a line is read from a filehandle (via
readline() or <> ), or when tell() or seek() is called on it, $.
becomes an alias to the line counter for that filehandle. You can
adjust the counter by assigning to $. , but this will not actually
move the seek pointer. Localizing $. will not localize the
filehandle's line count. Instead, it will localize perl's notion of
which filehandle $. is currently aliased to. $. is reset when the
filehandle is closed, but not when an open filehandle is reopened
without an intervening close(). For more details, see I/O Operators in
perlop. Because <> never does an explicit close, line numbers increase
across ARGV files (but see examples in eof). You can also use
HANDLE->input_line_number(EXPR) to access the line counter for a given
filehandle without having to worry about which handle you last
accessed. Mnemonic: many programs use "." to mean the current line
number.
REFERENCE:
http://perldoc.perl.org/perlvar.html

Related

Iterative storing items to corresponding files by a given instruction

I want to perform sorting items to corresponding files by a given instruction.
The instruction is in instruction.txt:
item_1 file_5
item_3 file_2
item_6 file_7
item_22 file_2
...
item_m file_n
Items are stored in a contents.txt:
>item_1
blablas
bla
>item_2
blas
...
>item_m
bla
bla
bla
I want the procedure to read the instructions for each item, go to the contents file and extract an item with its contents (including >item_*, excluding next >) and append to the corresponding file_**and save it as file_**_upd.
Would be grateful for the assistance!
P.S. some files belong to same files!
Perl to the rescue!
perl -we '
open my $instruction, "<", "instruction.txt" or die $!;
my %where = map split, <$instruction>;
open my $contents, "<", "contents.txt" or die $!;
my $out;
while (<$contents>) {
open $out, ">", $where{$1} if /^>(.*)/;
print {$out} $_;
}'
open opens a file, "<" means for reading, while ">" means for writing.
The diamond operator <> reads from the file handle, see readline.
The pairs from instructions are saved to an associative table %where (see also map and split).
The contents.txt is read line by line, if a line starts with >, a new output file is created. The file handle of the output is declared outside the loop, so it survives its iterations, so all lines after a > are printed to the same file.
Update: To handle multiple items per file, and also to output items with no file assigned, you need to a bit more work:
perl -we '
open my $instruction, "<", "instruction.txt" or die $!;
my %where = map split, <$instruction>;
open my $contents, "<", "contents.txt" or die $!;
my $out;
my $unknown = "file_unknown";
my %created;
while (<$contents>) {
open $out, $created{ $where{$1} // $unknown }++ ? ">>" : ">",
$where{$1} // $unknown if /^>(.*)/;
print {$out} $_;
}'
The hash %created keeps track of already created files, so they are appended rather than overwriten next time. The defined-or operator // is used to output items with no file assigned to file_unknown.

bash: transform scaffold fasta

I have a fasta file with the following sequences:
>NZ_OCNF01123018.1
TACAAATACAACAAATACAAGTACACCAAGTACAAATACAAGTATCCCAAGTACAAATACAAGTA
TCCCAAGTACAAATACAAGTATTCCAAGTACAAATACAAAACCTGTTGAGCAACCTAAACCTGTTGAAC
AGCCCAAACCTGTTGAACAGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAACCTTTATCCGCACTTA
CGAGCAAATACACCAATACCGCTTTATCGGCACAGTCTGCCCAAATTGACGGATGCACCATGTTACCCAACAC
ATCAATCAACGTTTGTGGGATCACCTGAAAAAGGGCGCGGTTTGTGGTTGATG
>NZ_OCNF01123018.2
AATTGTCGTGTAAAGCCACACCAAACCCCATTATAGCCCCAAAAACACCAAAAAGGCTGCCTGAACCACATTTCAGACAG
And I want to split the all sequences in the file that contain multiple N at the site where it occurs and make two sequences out of it.
Expected solution:
>NZ_OCNF01123018.1
TACAAATACAACAAATACAAGTACACCAAGTACAAATACAAGTATCCCAAGTACAAATACAAGTA
TCCCAAGTACAAATACAAGTATTCCAAGTACAAATACAAAACCTGTTGAGCAACCTAAACCTGTTGAAC
AGCCCAAACCTGTTGAACAGC
>contig1
AAACCTTTATCCGCACTTA
CGAGCAAATACACCAATACCGCTTTATCGGCACAGTCTGCCCAAATTGACGGATGCACCATGTTACCCAACAC
ATCAATCAACGTTTGTGGGATCACCTGAAAAAGGGCGCGGTTTGTGGTTGATG
>NZ_OCNF01123018.2
AATTGTCGTGTAAAGCCACACCAAACCCCATTATAGCCCCAAAAACACCAAAAAGGCTGCCTGAACCACATTTCAGACAG
my (inelegant) approach would be this:
perl -pe 's/[N]+/\*/g' $file | perl -pe 's/\*/\n>contig1\n/g'
of course that also replaces the N of the sequence header and creates headers without a sequence. As a plus, it would be nice to number the new 'contigs' from 1 to x in case there are multiple sequences with N.
What do you suggest?
I'd suggest to use split instead of trying to get a regex just right, and in a script instead of a brittle and crammed "one"-liner.
use warnings;
use strict;
use feature 'say';
my $file = shift #ARGV;
die "Usage: $0 filename\n" if !$file; # also check submitted $file
my $content = do { # or: my $content = Path::Tiny::path($file)->slurp;
local $/;
open my $fh, '<', $file or die "Can't open $file: $!";
<$fh>;
};
my #f = grep { /\S/ } split /(?<!>)NN+/, $content;
say shift #f;
my $cnt;
for (#f) {
say "\n>contig", (++$cnt), ":\n$_";
}
This slurps the file into $content since NN+ can span multiple lines; Path::Tiny module can make that cleaner. The first element of the obtained array needs no >contig so it is shifted away.
The negative lookbehind (?<!...) makes the regex in split's separator pattern match NN+ only when not preceded by >, thus protecting (excluding) header lines that may start with that. If headers may contain consecutive N which are not right after > then you need to refine this.
I expaned your perl one-liner a bit:
cat file.fasta | \
perl -pe 's/\n//g unless /^>/; s/>/\n>/g;' | \
perl -pe 's/N+(?{$n++})/\n>contig${n}\n/g unless /^>/'
the first part is to remove newlines between bases, the second part is to replace continuous 'N'.

Extracting the first two characters from a file in perl into another file

I'm having a little bit of trouble with my code below -- I'm trying to figure out how to open up all these text files (.csv files that end in DIS that all have one line in them) and get the first two characters (these are all numbers) from them and print them into another file of the same name, with a ".number" suffix. Some of these .DIS files don't have anything in them, in which case I want to print "0".
Lastly, I would like to go through each original .DIS file and delete the first 3 characters -- I did this through bash.
my #DIS = <*.DIS>;
foreach my $file (#DIS){
my $name = $file;
my $output = "$name.number";
open(INHANDLE, "< $file") || die("Could not open file");
while(<INHANDLE>){
open(OUT_FILE,">$output") || die;
my $line = $_;
chomp ($line);
my $string = $line;
if ($string eq ""){
print "0";
} else {
print substr($string,0,2);
}
}
system("sed -i 's/\(.\{3\}\)//' $file");
}
When I run this code, I get a list of numbers are concatenated together and empty .DIS.number files. I'm rather new to Perl, so any help would be appreciated!
When I run this code, I get a list of numbers are concatenated together and empty .DIS.number files.
This is because of this line.
print substr($string,0,2);
print defaults to printing to STDOUT (ie. the screen). You need to give it the filehandle to print to.
print OUT_FILE substr($string,0,2);
They're being concatenated because print just prints what you tell it to, it won't put newlines in for you (there are some global variables which can change this, don't mess with them). You have to add the newline yourself.
print OUT_FILE substr($string,0,2), "\n";
As a final note, when working with files in Perl I would suggest using lexical filehandles, Path::Tiny, and autodie. They will avoid a great number of classic problems working with files in Perl.
I suggest you do it like this
Each *.dis file is opened and the contents read into $text. Then a regex substitution is used to remove the first three characters from the string and capture the first two in $1
If the substitution succeeded then the contents of $1 are written to the number file, otherwise the original file is empty (or shorter than two characters) and a zero is written instead. The remaining contents of $text are then written back to the *.dis file
use strict;
use warnings;
use v5.10.1;
use autodie;
for my $dis_file ( glob '*.DIS' ) {
my $text = do {
open my $fh, '<', $dis_file;
<$fh>;
};
my $num_file = "$dis_file.number";
open my $dis_fh, '>', $dis_file;
open my $num_fh, '>', $num_file;
if ( defined $text and $text =~ s/^(..).?// ) {
print $num_fh "$1\n";
print $dis_fh $text;
}
else {
print $num_fh "0\n";
print $dis_fh "-\n";
}
}
this awk script extract the first two chars of each file to it's own file. Empty files expected to have one empty line based on the spec.
awk 'FNR==1{pre=substr($0,1,2);pre=length(pre)==2?pre:0; print pre > FILENAME".number"}' *.DIS
This will remove the first 3 chars
cut -c 4-
Bash for loop will be better to do both, which we'll need to modify the awk script little bit
for f in *.DIS;
do awk 'NR==1{pre=substr($0,1,2);$0=length(pre)==2?pre:0; print}' $f > $f.number;
cut -c 4- $f > $f.cut;
done
explanation: loop through all files in *.DTS, for the first line of each file, try to get first two chars (1,2) of the line ($0) assign to pre. If the length of pre is not two (either the line is empty or with 1 char only) set the line to 0 or else use pre; print the line, output file name will be input file appended with .number suffix. The $0 assignment is a trick to save couple keystrokes since print without arguments prints $0, otherwise you can provide the argument.
Ideally you should quote "$f" since it may contain space in file name...

Delete first line in file if it matches a pattern

I wonder if there is an efficient way to delete the first line in a file if it matches a specified pattern. For example, I have a file with data of the following form:
Date,Open,High,Low,Close,Volume,Adj.Volume
2012-01-27,42.38,42.95,42.27,42.68,2428000,42.68
2012-01-26,44.27,44.85,42.48,42.66,5785700,42.66
.
.
.
I want to delete the first line, only if it contains the text (as shown in the example in the first line), and leave it unchanged if it contains only numbers(as in the rest of the lines). This task is quite easy and I've accomplished it by applying the following peace of code which writes each line to a $newFile as long as it does not include Date pattern:
while( <$origFile> )
{
chomp($_);
print $newFile $_ unless ($_ =~ m/Date/g)
}
So as I mentioned, that makes the job done. However it seems that it's a great waste of resources to read each line in a whole file when it is known that the text can appear only in the first line..
Is there any way to accomplish this task more efficiently?
NOTE: I already found an almost similar question here, but since I want my code to be available on Linux and Windows as well, using sed will not help me here.
Thanks in advance!
$. can be used to determine if are processing the first line of the file.
perl -i.bak -ne'print if $. != 1 || !/^Date/;' file
However it seems that it's a great waste of resources to read each line in a whole file
It's impossible to delete from anywhere but the end of a file. To delete from the start or middle, everything that follows in the file needs to be shifted, which means it must be both read and written.
You can only avoid work if the first line doesn't match (by doing nothing at all). If you need to remove the line, you must copy the whole file.
The Tie::File module is ideal for this. It is very efficient as it does block IO instead of reading a line at a time, and it makes the program very simple to write.
use strict;
use warnings;
use Tie::File;
tie my #data, 'Tie::File', 'mydatafile' or die $!;
shift #data if $data[0] =~ /Date/;
untie #data;
Only do the test on the first line, then just run through the rest of the file without checking:
if (defined( $_ = <$origFile> )) {
if ( ! m/Date/o ) { print $newFile $_ }
my $data;
for (;;) {
my $readRes = read($origFile, $data, 0x10000);
if (!defined $readRes) { die "Can't read: $!" }
if ($readRes == 0) { last }
print $newFile $data;
}
}

append a text on the top of the file

I want to add a text on the top of my data.txt file, this code add the text at the end of the file. how I can modify this code to write the text on the top of my data.txt file. thanks in advance for any assistance.
open (MYFILE, '>>data.txt');
print MYFILE "Title\n";
close (MYFILE)
perl -pi -e 'print "Title\n" if $. == 1' data.text
Your syntax is slightly off deprecated (thanks, Seth):
open(MYFILE, '>>', "data.txt") or die $!;
You will have to make a full pass through the file and write out the desired data before the existing file contents:
open my $in, '<', $file or die "Can't read old file: $!";
open my $out, '>', "$file.new" or die "Can't write new file: $!";
print $out "# Add this line to the top\n"; # <--- HERE'S THE MAGIC
while( <$in> ) {
print $out $_;
}
close $out;
close $in;
unlink($file);
rename("$file.new", $file);
(gratuitously stolen from the Perl FAQ, then modified)
This will process the file line-by-line so that on large files you don't chew up a ton of memory. But, it's not exactly fast.
Hope that helps.
There is a much simpler one-liner to prepend a block of text to every file. Let's say you have a set of files named body1, body2, body3, etc, to which you want to prepend a block of text contained in a file called header:
cat header | perl -0 -i -pe 'BEGIN {$h = <STDIN>}; print $h' body*
Appending to the top is normally called prepending.
open(M,"<","data.txt");
#m = <M>;
close(M);
open(M,">","data.txt");
print M "foo\n";
print M #m;
close(M);
Alternately open data.txt- for writing and then move data.txt- to data.txt after the close, which has the benefit of being atomic so interruptions cannot leave the data.txt file truncated.
See the Perl FAQ Entry on this topic
perl -ni -e 'print "Title\n" $.==1' filename , this print the answer once

Resources