How to delete all the lines that match specific condition - bash

I have a number of pdb files and I want to keep only those lines that starts with ^FORMUL and if line has C followed by the number that is larger then (C3,C4,C5,C6..100 etc) then I should not print it. Second condition is that within every line sum of C, H and N should be >= 6, then delete it
So overall delete the lines in which C is followed by number more then 2> and sum of C+O+N is >= then 6.
FORMUL 3 HOH *207(H2 O) (print it)
FORMUL 2 SF4 FE4 S4 (print it)
FORMUL 3 NIC C5 H7 N O7 (don't print, there is C5 and sum is more then 6)
FORMUL 4 HOH *321(H2 O) (print it)
FORMUL 3 HEM 2(C34 H32 FE N4 O4) (don't print, there is C34)
I have tried to do it in perl but lines are soo diverse from each other so Im not sure if it is possible to do.
Over all these conditions chould be included together, meaning that all lines in which C>2 and sum>=6 should be deleted.
C1 O5 N3 should be deleted; C3 N1 01 should not be deleted although C is 3.
In perl I don't know how to assign these two conditions. Here I wrote opposite situation not to delete but to print these lines if these two conditions are not met.
#!/usr/bin/perl
use strict;
use warnings;
my #lines;
my $file;
my $line;
open ($file, '<', '5PCZ.pdb') or die $!;
while (my $line = <$file>)
{
if ($line =~ m/^FORMUL/)
{
push (#lines, $line);
}
}
close $file;
#print "#lines\n";
foreach $line(#lines)
{
if ($line eq /"C"(?=([0-2]))/ )
{
elsif ($line eq "Sum of O,N & C is lt 6")
print #lines
}
}

As you've seen, it's probably easier to write this as a filter that prints the lines that you want to keep. I've also written this following the Unix Filter Model (reads from STDIN, writes to STDOUT) as that makes the program far more flexible (and, interestingly, easier to write!)
Assuming that you're running the program on Linux (or similar) and that your code is in an executable file called my_filter (I recommend a more descriptive name!) then you would call it like this:
$ my_filter < 5PCZ.pdb > 5PCZ.pdb.new
The code would look like this:
#!/usr/bin/perl
use strict;
use warnings;
while (<>) { # read from STDIN a line at a time
# Split data on whitespace, but only into four columns
my #cols = split /\s+/, $_, 4;
next unless $cols[0] eq 'FORMUL';
# Now extract the letter stuff into a hash for easy access.
# We extract letters from the final column in the record.
my %letters = $cols[-1] =~ m/([A-Z])(\d+)/g;
# Give the values we're interested in, a default of 0
$letters{$_} //= 0 for (qw[C O N]);
next if $letters{C} > 2
and $letters{C} + $letters{O} + $letters{N} >= 6;
# I think we can then print the line;
print;
}
This seems to give the correct output for your sample data. And I hope the comments make it obvious how to tweak the conditions.

Extended Awk solution:
awk -F'[[:space:]][[:space:]]+' \
'/^FORMUL/{
if ($4 !~ /\<C/) print;
else {
match($4, /\<C[0-9]+/);
c=substr($4, RSTART+1, RLENGTH);
if (c > 2) next;
else {
match($4, /\<O[0-9]+/);
o=substr($4, RSTART+1, RLENGTH);
match($4, /\<N[0-9]+/);
n=substr($4, RSTART+1, RLENGTH);
if (c+o+n < 6) print
}
}
}' 5PCZ.pdb
The output:
FORMUL 3 HOH *207(H2 O)
FORMUL 2 SF4 FE4 S4
FORMUL 4 HOH *321(H2 O)

Related

Perl sorting Alpha characters in a special way

I know this question may have been asked a million times but I am stumped. I have an array that I am trying to sort. The results I want to get are
A
B
Z
AA
BB
The sort routines that are available dont sort it this way. I am not sure if it can be done. Here's is my perl script and the sorting that I am doing. What am I missing?
# header
use warnings;
use strict;
use Sort::Versions;
use Sort::Naturally 'nsort';
print "Perl Starting ... \n\n";
my #testArray = ("Z", "A", "AA", "B", "AB");
#sort1
my #sortedArray1 = sort #testArray;
print "\nMethod1\n";
print join("\n",#sortedArray1),"\n";
my #sortedArray2 = nsort #testArray;
print "\nMethod2\n";
print join("\n",#sortedArray2),"\n";
my #sortedArray3 = sort { versioncmp($a,$b) } #testArray;
print "\nMethod3\n";
print join("\n",#sortedArray3),"\n";
print "\nPerl End ... \n\n";
1;
OUTPUT:
Perl Starting ...
Method1
A
AA
AB
B
Z
Method2
A
AA
AB
B
Z
Method3
A
AA
AB
B
Z
Perl End ...
I think what you want is to sort by length and then by ordinal. This is easily managed with:
my #sortedArray = sort {
length $a <=> length $b ||
$a cmp $b
} #testArray;
That is exactly as the English: sort based on length of a vs b, then by a compared to b.
my #sorted =
sort {
length($a) <=> length($b)
||
$a cmp $b
}
#unsorted;
or
# Strings must be have no characters above 255, and
# they must be shorter than 2^32 characters long.
my #sorted =
map substr($_, 4),
sort
map pack("N/a*", $_),
#unsorted;
or
use Sort::Key::Maker sort_by_length => sub { length($_), $_ }, qw( int str );
my #sorted = sort_by_length #unsorted;
The second is the most complicated, but it should be the fastest. The last one should be faster than the first.

xyz coordinates manipulation using sed or awk

I have a huge number of plain text files containing Cartesian xyz coordinates of chemical structures. A sample could look like that:
B -1.38372433 0.56274955 2.22204795
B 0.01637488 1.69210489 1.81167819
B 0.29103422 -0.35499374 0.15388510
B 1.14485163 0.19631678 1.74992009
Fe -0.92583118 1.01775624 0.27450973
S -0.35374797 -1.05624221 1.74656393
C -1.87367299 1.66919492 -1.27526252
O -2.42173866 2.04584255 -2.17123145
H -2.54747585 0.75818308 2.22742141
H 0.62677160 -0.81072498 -0.88156036
H 0.38495881 2.74424131 2.19841880
H 2.25808628 0.09159351 1.37282254
In this case, each H atom is bonded to a B atom with a distance of 1.18 angstroms. What I'm supposed to do is to change, in turn, each BH vertex by a P vertex.
Using bash, I'd like to act on all text files at once by taking the coordinates of the first B atom encountered and use it as a point of origin of a sphere and search within a radius of 1.18 Angstroms for the bonded Hydrogen atom, delete this H atom with its coordinates then change the B into a P atom.
An expected output of the above sample would be something like that:
P -1.38372433 0.56274955 2.22204795
B 0.01637488 1.69210489 1.81167819
B 0.29103422 -0.35499374 0.15388510
B 1.14485163 0.19631678 1.74992009
Fe -0.92583118 1.01775624 0.27450973
S -0.35374797 -1.05624221 1.74656393
C -1.87367299 1.66919492 -1.27526252
O -2.42173866 2.04584255 -2.17123145
H 0.62677160 -0.81072498 -0.88156036
H 0.38495881 2.74424131 2.19841880
H 2.25808628 0.09159351 1.37282254
I've done something similar a while back, but that was adding xyz coordinates of a H atom at a distance of 1.2 Angstroms from an existing B atom. what I used back then was:
for i in *.inp; do awk '/^B / { print; if (++count == 1) printf("%-10.8f %-14.8f %-14.8f %s\n", "H", $2+1.2, $3+1.2, $4+1.2); next } { print }' $i > temp/`basename $i`--H.inp; done
However, I'm still not successful in coming up with something similar to solve my current problem.
Any help is really appreciated
Thanks in advance
Perl solution:
#!/usr/bin/perl
use warnings;
use strict;
my #P;
my $deleted;
while (<>) {
my #F = split;
$F[0] = 'P', #P = #F if ! #P && 'B' eq $F[0];
if ('H' eq $F[0] && ! $deleted) {
die "No B found yet!\n" unless #P;
my $close = grep abs($F[$_] - $P[$_]) <= 1.18001, 1, 2, 3;
$deleted = 1, next if 3 == $close;
}
print "#F\n";
}

Perl sqrt, cube issue: 1 showing up after each line

I am having a tiny issue with a small perl script using arithmetic operators. After my cube root, and square root operators, a 1 shows up. I was testing this script on an openSUSE 42.1 VM.
I'm just not too certain what the 1 after each line is, I have tried looking it up, but am not too certain. I mainly script in bash, and ksh, so this perl syntax is a bit new to me.
Script:
#!/usr/bin/perl
# Provide a sum, cube of the sum, and square root of the sum of three set variables
# Set variables
$v1=10;
$v2=9;
$v3=8;
$val=$v1+$v2+$v3;
$cube=$val ** (1/3);
$square= sqrt($val);
print "Sum of 10, 9, 8: $val\n";
print
print "Cube of Sum: $cube\n";
print
print "Square of Sum: $square\n";
print
print "Thanks for using this script!"
Your lines just saying
print
are not statements in themselves as they are not terminated by a ;. Instead they are part of statements of the form
print print "text";
The inner print has an argument of "text" and prints that, the outer print has an argument of print "text" and print the value of that, and when succesful print returns a value of 1 (perldoc only says it returns true, so don't rely it being 1) - so a 1 is printed.
If the point was to format your output nicely, you should explicitly print "\n".
As has been explained, half of your print calls are printing the return value of the following print statement because you are missing a semicolon at the end of the line to terminate the statement
Also, print on its own will print the value of the default variable $_, not a newline as you expected. You need to write print "\n"; to achieve what you intend
It's also very important to add use strict and use warnings 'all' to the top of every Perl program you write. You will also need to declare all of your variables using my
#!/usr/bin/perl
use strict;
use warnings 'all';
# Provide a sum, cube of the sum, and square root of the sum of three set variables
# Set variables
my $v1 = 10;
my $v2 = 9;
my $v3 = 8;
my $val = $v1 + $v2 + $v3;
my $cube = $val**( 1 / 3 );
my $square = sqrt($val);
print "Sum of 10, 9, 8: $val\n";
print "\n";
print "Cube root of Sum: $cube\n";
print "\n";
print "Square root of Sum: $square\n";
print "\n";
print "Thanks for using this script!\n";
print "\n";
output
Sum of 10, 9, 8: 27
Cube root of Sum: 3
Square root of Sum: 5.19615242270663
Thanks for using this script!
It's also worth pointing out that there's a construct called a here document that will let you do this more neatly and clearly. If you change those print statements to just one, like this, then the intention is clear and the output is identical to that of the original code
print <<END;
Sum of 10, 9, 8: $val
Cube root of Sum: $cube
Square root of Sum: $square
Thanks for using this script!
END
As Henrik states in his answer, the lines with print and no ; are the problem.
An alternate way to get Perl to print a blank line between the main lines of output is to add an addition new line character, \n, at the end of each of the print lines. The code would become:
#!/usr/bin/perl
# Provide a sum, cube of the sum, and square root of the sum of three set variables
# Set variables
$v1=10;
$v2=9;
$v3=8;
$val=$v1+$v2+$v3;
$cube=$val ** (1/3);
$square= sqrt($val);
print "Sum of 10, 9, 8: $val\n\n";
print
print "Cube of Sum: $cube\n\n";
print
print "Square of Sum: $square\n\n";
print
print "Thanks for using this script!"
The output is:
Sum of 10, 9, 8: 27
Cube of Sum: 3
Square of Sum: 5.19615242270663
By the way, your equation for calculating the cube of the sum calculates the cubed root. To calculate the cube of the sum you need,
$cube=$val ** (3);
Likewise, your equation to find the square of the sum is calculating the square root, not the square. To find the square of the sum you need to raise the sum to the power of 2.

Execute an arbitrary number of nested loops in bash

I am trying to iterate through an n-dimensional space with a series of nested for-loops in bash.
VAR1="a b c d e f g h i"
VAR2="1 2 3 4 5 6 7 8 9"
VAR3="a1 b2 b3 b4 b5 b6"
for i1 in $VAR1; do
for i2 in $VAR2; do
for i3 in $VAR3; do
echo "$i1 $i2 $i3"
done
done
done
Now as I get more dimensions to iterate through, I realize it would be easier/better to be able to specify an arbitrary number of variables to loop through.
If I were using a more sophisticated programming language, I might use recursion to pass a list of lists to a function, pop one list off, iterate through it, recursively calling the function each time through the loop, passing the now reduced list of lists, and assembling the n-tuples as I go.
(I tried to pseudocode what that would look like, but it hurt my head thinking about recursion and constructing the lists.)
function iterate_through(var list_of_lists)
this_list=pop list_of_lists
var new_list = []
for i in this_list
new_list.push(i)
new_list.push(iterate_through(list_of_lists))
# return stuff
# i gave up about here, but recursion may not even be necessary
Anyone have a suggestion for how to accomplish iterating through an arbitrary number of vars in bash? Keeping in mind the goal is to iterate through the entire n-dimensional space, and that iteration is not necessarily part of the solution.
If parallel is acceptable, then one could simplify the nested for loop as
parallel -P1 echo {1} {2} {3} ::: $VAR1 ::: $VAR2 ::: $VAR3
In the general case, it could be perhaps feasible to first assemble this command and then execute it...
You can use recursion to compute cartesian product
The following script will do the job with variable length input vector :
#!/bin/bash
dim=("a b c d e f g h i" "1 2 3 4 5 6 7 8 9" "a1 b2 b3 b4 b5 b6")
function iterate {
local index="$2"
if [ "${index}" == "${#dim[#]}" ]; then
for (( i=0; i<=${index}; i++ ))
do
echo -n "${items[$i]} "
done
echo ""
else
for element in ${dim[${index}]}; do
items["${index}"]="${element}"
local it=$((index+1))
iterate items[#] "$it"
done
fi
}
declare -a items=("")
iterate "" 0
The following gist will take as input arguments all your dimensions array (with space separated items) : https://gist.github.com/bertrandmartel/a16f68cf508ae2c07b59

Disable pattern caching

I am writing a program that will convert a postscript file to a simpler sequence of points that I can send to a plotter I am building. I am doing this by running a bit of header code that replaces the painting operations with operations that print points to stdout, which my main control program will then use:
/stroke { gsave
matrix defaultmatrix setmatrix
flattenpath
/str 20 string def
{(m ) print 2 copy exch str cvs print ( ) print =}
{(l ) print exch str cvs print ( ) print =}
{6 {pop} repeat (error) =} % should never happen...
{(l ) print exch str cvs print ( ) print =}
pathforall
grestore
stroke
} bind def
/fill {
gsave stroke grestore fill
} bind def
As a side note, I really wish postscript had a printf command, like 1 1 add (1+1=%d) printf.
In order for this to work with fonts, I disabled font caching by setting the cache limit to 0, with 0 setcachelimit. Otherwise, the postscript interpreter will not invoke the painting operations for subsequent uses of cached objects. I would have rather disabled font caching by redefining setcachedevice and setcachedevice2 but those operators also have to handle some character metric stuff, not just the caching.
User paths can also be cached , and I was able to disable this caching by redefining ucache and setting the cache limit to 0 via /ucache {} def.
However, there does not seem to be a command for configuring the pattern cache parameters, and patterns do not need to explicitly request caching., and even if there was I would need to force it to invoke the painting operations for each and every pattern cell even within the same fill operation. How can I disable pattern caching?
<</MaxPatternCache 0>> setsystemparams
Assuming that your interpreter doesn't have a password protecting the system parameters, and that it honours this system parameter.
See appendix C of the 3rd edition PLRM, especially section "C.3.3 Other Caches". You will need to consider Forms as well.
Here's an attempt at a printf implementation to match your syntax.
/formats <<
(d) { cvi 20 string cvs }
>> def
% val1 val2 .. valN (format-str) printf -
/printf {
0 1 index (%) { % ... (fmt) n (fmt) (%)
search { % ... (fmt) n (post)(%)(pre)
pop pop exch 1 add exch (%) % ... (fmt) n=n+1 (post) (%)
}{ % ... (fmt) n (rem)
pop exit
} ifelse
} loop % val1 val2 .. valN (fmt) n
dup { % ... (fmt) n
exch (%) search pop % ... n (post)(%)(pre)
print pop % ... n (post)
exch dup % ... (post) n n
2 add -1 roll % .. (post) n val1
3 1 roll 1 sub % .. val1 (post) n=n-1
exch % .. val1 n (post)
dup 0 1 getinterval % .. val1 n (post) (p)
exch 1 1 index % .. val1 n (p) (post) 1 (post)
length 1 sub getinterval % .. val1 n (p) (ost)
exch 4 -1 roll % .. n (ost) (p) val1
exch //formats exch
2 copy known not { pop /default } if get exec
print % .. n (ost)
exch
} repeat
pop
print
} def
1 1 add (1+1=%d) printf
But, if I may criticize a little, this probably isn't the best use of postscript. For one, the conversion specifiers aren't really necessary since postscript objects carry their own type info. There was a NeWS extension operator called printf which comes closer to the mark, I think (ref) (pdf).
I know it says sprintf here which is a little different, but the printf entry just referred to this entry.

Resources