how to improve Performance for this code? - performance

this code is running on a file of 200M lines at least. and this takes a lot of time
I would like to know if I can improve the runtime of this loop.
my #bin_lsit; #list of 0's and 1's
while (my $line = $input_io->getline) {
if ($bin_list[$i]) {
$line =~ s/^.{3}/XXX/;
} else {
$line =~ s/^.{3}/YYY/;
}
$output_io->appendln($line);
$i++;
}

A regex solution may be overkill here. How about replacing the if/else blocks with:
substr($line, 0, 3, $bin_list[$i] ? 'XXX' : 'YYY';

Smallest change is probably to buffer between appendln's
my #bin_lsit; #list of 0's and 1's
my $i = 0;
while (my $line = $input_io->getline) {
if ($bin_list[$i]) {
$line =~ s/^.{3}/XXX/;
} else {
$line =~ s/^.{3}/YYY/;
}
$buffer .= $line;
if ( $i % 1000 == 0 ) {
$output_io->appendln($buffer);
$buffer = '';
}
$i++;
}
if ( $buffer ne '' ) {
$output_io->appendln($buffer);
}
Are you using IO::All?
I couldn't find anything else with appendln...
Replacing this:
my $input_io = io 'tmp.this';
my $output_io = io 'tmp.out';
while (my $line = $input_io->getline ) {
$output_io->appendln($line);
}
With this:
open(IFH, 'tmp.this');
open(OFH, '>>tmp.out');
while (my $line = <IFH> ) {
print OFH $line;
}
close IFH;
close OFH;
Is quite a bit faster (1 sec vs 23 in my test case).

Related

Add automatic header comments for all source files

I have a standardized way of writing comments before and after a function.
For example before declaring any function I write,
!---------------------------
! NAME_OF_FUNC (no)
!---------------------------
where no is the nth function in a given file containing multiple functions.
I know that a function e.g begins with (Fortran convention) either subroutine NAME_OF_SUB or function NAME_OF_FUNC. Hence, my end result would be
!---------------------------
! NAME_OF_FUNC (no)
!---------------------------
function NAME_OF_FUNC(...)
end function
!---------------------------
Can somebody show an example of how to write a bash script or in any other scripting language a code that can go through all my source files and the standard convention I just showed an example of?
Here is an example in Perl. It does not take backup before overwriting (I would recommend that you try to improve this script and add backup functionality). It also does not add the end of subroutine marker. But it would easy to add that functionality, please try. It also assumes that you want to modify all *.f95 files in the current directory and all its sub directories:
use feature qw(say state);
use strict;
use warnings;
use File::Find::Rule;
my #files = File::Find::Rule->new->name('*.f95')->in('.');
for my $fn (#files) {
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my $txt = do {local $/; <$fh>};
close $fh;
process_txt( \$txt );
save_txt( $fn, \$txt );
}
sub save_txt {
my ( $fn, $txt ) = #_;
open ( my $fh, '>', $fn ) or die "Could not open file '$fn': $!";
print $fh $$txt;
close $fh;
}
sub process_txt {
my ( $txt ) = #_;
my $i = 1;
$$txt =~ s/^(.*(?i:function|subroutine)\s+)([a-zA-Z_]+)(\s*\(.*$)/
do_substitution($1, $2, $3, $i++)/egmx;
}
sub do_substitution {
my ( $head, $name, $end, $i ) = #_;
my $line = $head . $name . $end;
$line =~ s/\s+$//;
my $N = length $line;
my $cline = '!' . '-' x $N;
my $mline = '! ' . $name;
my $snum = "($i)";
my $M = (length $mline) + (length $snum);
my $mspc = ($N > $M) ? (' ' x ($N-$M)) : ' ';
$mline = $mline . $mspc . $snum;
my $new_txt = join "\n", $cline, $mline, $cline, $line;
return $new_txt;
}

Error executing SQL using Perl

I am trying to execute the open source code which finds the list of tables involved in SQL.
I am working on Retrieve table names from Oracle queries.
I understood the expressions and commands to some extent and tried it.
Details of my execution:
GetTable.pl file
same as in the link
test.sql file
I didn't use the one in link. Instead I had only a single SQL for testing.
SELECT emp_name FROM load_tables.temp;
Executed in Strawberry Perl
I tried the following
$ perl GetTable.pl
Usage : GetTable <sql query file>
$ perl test.sql
Can't locate object method "FROM" via package "load_tables" (perhaps you forgot to load "load_tables"?) at test.sql line 1
Can someone help me in executing it? I'm not sure if there is problem with code as I could see two people have executed successfully.
Perl code
#!/usr/bin/perl
use warnings;
#Function which gets the table names and formats and prints them.
sub printTable {
my $tab = shift;
$tab =~ s/,\s+/,/g;
$tab =~ s/\s+,/,/g;
my #out = split( /,/, $tab );
foreach ( #out ) {
$_ =~ s/ .*//;
print $opr, $_, "\n";
}
}
# Function which gets the indivdual queries and separtes the table
# names from the queries. Sub-Queries, co-related queries, etc..
# will also be handled.
sub process {
local $opr;
my $line = shift;
$line =~ s/\n/ /g;
if ( $line =~ m/^\s*(select|delete)/i ) {
if ( $line =~ m/^\s*select/i ) {
$opr = "SELECT: ";
}
else {
$opr = "DELETE: ";
}
if ( $line =~ m/from.*where/i ) {
while ( $line =~ m/from\s+(.*?)where/ig ) {
&printTable( $1 );
}
}
elsif ( $line =~ m/from.*;/i ) {
while ( $line =~ m/from\s+(.*);/ig ) {
&printTable( $1 );
}
}
}
elsif ( $line =~ m/^\s*update\s+(\w+)\s+/i ) {
$opr = "UPDATE: ";
&printTable( $1 );
}
elsif ( $line =~ m/^\s*insert\s+into\s+(\w+)\s+/i ) {
$opr = "INSERT: ";
&printTable( $1 );
}
}
#The main function which reads the files and reads the
#query into a variable and sends it to process function.
if ( #ARGV != 1 ) {
print "Usage: GetTable <sql query file>\n";
exit 1;
}
open QFILE, $ARGV[0] or die "File $ARGV[0]: $! \n";
my $flag = 0;
my $query = "";
my $conds = "select|insert|update|delete";
while ( <QFILE> ) {
next if ( /^$/ );
if ( $flag == 1 ) {
$query .= $_;
if ( /;\s*$/ ) {
$flag = 0;
&process( $query );
}
}
elsif ( /^\s*($conds).*;\s*/i ) {
&process( $_ );
}
elsif ( /^\s*($conds)/i ) {
$flag = 1;
$query = $_;
}
}
close QFILE;
Two important skills to learn as a programmer are a) accuracy in following instructions and b) reading the error message carefully.
You started by running GetTable.pl. But that program requires a parameter (the name of an SQL file to analyse) and the error message tried to tell you that.
I don't know why, but instead of doing what the error message told you to do (which would have been to run perl GetTable.pl test.sql) you decided to ask Perl to run your SQL file.
The second error message you got was the Perl compiler trying to make sense of the SQL that you asked it to run. But the Perl compiler doesn't understand SQL, it understands Perl. So it's no surprise that it got confused.
To fix it, do what your first error message suggested—run the command
$ perl GetTable.pl test.sql

Keep shared entries among many files

I have hundreds of files, each with different number of entries (>xxxx) and want to keep only shared entries among all files, separately. I'm not sure what is the best method to do this, maybe perl! I used sort, uniq of bash, but I didn't get the correct answer. The format of IDs start with > and follows 4 characters among all files.
1.fa
>abcd
CTGAATGCC
2.fa
>abcd
AAATGCGCG
>efgh
CGTAC
3.fa
>abcd
ATGCAATA
>efgh
TAACGTAA
>ijkl
TGCAA
Final results, of this example would be:
1.fa
>abcd
CTGAATGCC
2.fa
>abcd
AAATGCGCG
3.fa
>abcd
ATGCAATA
This Perl program will do as you ask. It uses Perl's built-in edit in place functionality and renames the original files to 1.fa.bak etc. It shouldn't have a problem with blank lines in your data as long as the sequence is always on one line immediately following the ID
use strict;
use warnings 'all';
my #files = glob '*.fa';
printf "Processing %d file%s\n", scalar #files, #files == 1 ? "" : "s";
exit if #files < 2;
my %ids;
{
local #ARGV = #files;
while ( <> ) {
++$ids{$1} if /^>(\S+)/;
}
}
# remove keys that aren't in all files
delete #ids{ grep { $ids{$_} < #files } keys %ids };
my $n = keys %ids;
printf "%d ID%s common to all files\n", $n, $n == 1 ? '' : "s";
exit unless $n;
{
local #ARGV = #files;
local $^I = '.bak';
while ( <> ) {
next unless /^>(\S+)/ and $ids{$1};
print;
print scalar <>;
}
}
Here is Perl solution, that may help you:
use feature qw(say);
use strict;
use warnings;
my $file_dir = 'files';
chdir $file_dir;
my #files = <*.fa>;
my $num_files = scalar #files;
my %ids;
for my $file (#files) {
open ( my $fh, '<', $file) or die "Could not open file '$file': $!";
while (my $id = <$fh>) {
chomp $id;
chomp (my $sequence = <$fh>);
$ids{$id}++;
}
close $fh;
}
for my $file (#files) {
open ( my $fh, '<', $file) or die "Could not open file '$file': $!";
my $new_name = $file . '.new';
open ( my $fh_write, '>', $new_name ) or die "Could not open file '$new_name': $!";
while (my $id = <$fh>) {
chomp $id;
chomp (my $sequence = <$fh>);
if ( $ids{$id} == $num_files ) {
say $fh_write $id;
say $fh_write $sequence;
}
}
close $fh_write;
close $fh;
}
It assumes that all the .fa files are located in the directory named $file_dir, and it writes the new sequences to new files in the same directory. The new file names get the .new extension.

Perl script hangs for no reason

So I have this small script which checks two log files for a specific line and compares the lines.
The script is used on several different Windows Bamboo Agents. But on one it just hangs and doesn't exit. Since the script is used in bamboo the whole job hangs, when this script doesn't exit.
When I check the computer via remote access and kill the script the job continues until it reaches the script again.
This is the script, which is started by another script.
#! /usr/bin/perl
my $naluresult = 2;
my $hevcresult = 2;
my $hevcfailed = 0;
use strict;
use warnings;
#---------------------------------------------
#check for $ARGV[0] and $ARGV[1]
open( my $nalulog, "<", $ARGV[1] )
or die "cannot open File:$!\n\n";
while (<$nalulog>) {
chomp;
$_ =~ s/\s+//g;
if ( $_ =~ m/MD5:OK/ ) {
$naluresult = 1;
} else {
if ( $_ =~ m/MD5:MISSING/ ) {
$naluresult = 0;
}
}
}
close $nalulog;
#---------------------------------------------
open( my $hevclog, "<", $ARGV[0] )
or die "cannot open File:$!\n\n";
while (<$hevclog>) {
chomp;
$_ =~ s/\s+//g;
if ( $_ =~ m/MD5check:OK/ ) {
$hevcresult = 1;
last;
} else {
if ( $_ =~ m/MD5check:FAILED/ ) { $hevcfailed = 1; }
}
if ( $hevcfailed == 1 ) {
#do stuff
}
}
close $hevclog;
#---------------------------------------------
if ( $hevcresult == 2 ) {
print("Missing MD5 status in HEVC Output");
exit(-1);
} elsif ( $naluresult == 2 ) {
print("Missing MD5 status in NALU Output");
exit(-2);
} else {
if ( $naluresult == $hevcresult ) { exit(0); }
else {
#different if-statements to print() to log
exit(1);
}
}
#---------------------EOF---------------------
If your files are just normal disk files that aren't being simultaneously written to by other processes, or locked, or anything like that, then there is nothing in the code you have here that should hang. If the files are both reasonable sizes, the code you have here should read through the files and finish.
However, if one of the files is locked, or is immensely large, or if you have other code that can get stuck in an infinite loop, that would explain why your program is hanging.

How to compare the content of multiple txt file in bash shell and delete the one (file) which is duplicate

I am trying to achieve this is Mac OS, tried to achieve similar by using fdupes but didn't work. Here is what I am trying to achieve:
There are 100 files in directory 'alpha'
Pick one file A and compare it with each remaining file in the directory 'alpha'
If content of file A matches any file (duplicate), delete the duplicate file
Move to file B, and compare with the remaining file, and do the same (check for duplicate)
Repeat the same until all files are checked for duplicates. Remaining files should be unique
Update
I modified a bit something similar I found here, but I have to run it multiple times to take out the duplicates. It is not detecting duplicates in a single run (have to run it multiple times to detect duplicate). Not sure if it is working correctly
use Digest::MD5;
%check = ();
while (<*>) {
-d and next;
$fname = "$_";
print "checking .. $fname\n";
$md5 = getmd5($fname) . "\n";
if ( !defined( $check{$md5} ) ) {
$check{$md5} = "$fname";
}
else {
print "Found duplicate files: $fname and $check{$md5}\n";
print "Deleting duplicate $check{$md5}\n";
unlink $check{$md5};
}
}
sub getmd5 {
my $file = "$_";
open( FH, "<", $file ) or die "Cannot open file: $!\n";
binmode(FH);
my $md5 = Digest::MD5->new;
$md5->addfile(FH);
close(FH);
return $md5->hexdigest;
}
You should limit the number of times that you have to read each file's contents:
Inventory the files using Path::Class or some similar method.
a. Build a hash relating file sizes and MD5::Digest to a list of file names.
Compare likely duplicates only. Matching file size and digest.
The following is untested:
use strict;
use warnings;
use Path::Class;
use Digest::MD5;
my $dir = dir('.');
my %files_per_digest;
# Inventory Directory
while ( my $file = $dir->next ) {
my $size = $file->stat->size;
my $digest = do {
my $md5 = Digest::MD5->new;
$md5->addfile( $file->openr );
$md5->hexdigest;
};
push #{ $files_per_digest{"$size - $digest"} }, $file;
}
# Compare likely duplicates only
for my $files ( grep { #$_ > 1 } values %files_per_digest ) {
# Sort by alpha
#$files = sort #$files;
print "Comparing: #files\n";
for my $i ( reverse 0 .. $#files ) {
for my $j ( 0 .. $i - 1 ) {
my $fh1 = $files->[$i]->openr;
my $fh2 = $files->[$j]->openr;
my $diff = 0;
while ( !eof($fh1) && !eof($fh2) ) {
$diff = 1, last if scalar(<$fh1>) ne scalar(<$fh2>);
}
if ( $diff or !eof($fh1) or !eof($fh2) ) {
print " $files->[$i] ($i) is duplicate of $files->[$j] ($j)\n";
$files->[$i]->remove();
splice #$files, $i, 1;
}
}
}
}
I've used rdfind in the past with very good success. It's very accurate, fast, and seems to run leaner than fdupes. According to RDFind's web site (http://rdfind.pauldreik.se/), it can be installed using MacPorts.

Resources