Perl sorting Alpha characters in a special way - sorting

I know this question may have been asked a million times but I am stumped. I have an array that I am trying to sort. The results I want to get are
A
B
Z
AA
BB
The sort routines that are available dont sort it this way. I am not sure if it can be done. Here's is my perl script and the sorting that I am doing. What am I missing?
# header
use warnings;
use strict;
use Sort::Versions;
use Sort::Naturally 'nsort';
print "Perl Starting ... \n\n";
my #testArray = ("Z", "A", "AA", "B", "AB");
#sort1
my #sortedArray1 = sort #testArray;
print "\nMethod1\n";
print join("\n",#sortedArray1),"\n";
my #sortedArray2 = nsort #testArray;
print "\nMethod2\n";
print join("\n",#sortedArray2),"\n";
my #sortedArray3 = sort { versioncmp($a,$b) } #testArray;
print "\nMethod3\n";
print join("\n",#sortedArray3),"\n";
print "\nPerl End ... \n\n";
1;
OUTPUT:
Perl Starting ...
Method1
A
AA
AB
B
Z
Method2
A
AA
AB
B
Z
Method3
A
AA
AB
B
Z
Perl End ...

I think what you want is to sort by length and then by ordinal. This is easily managed with:
my #sortedArray = sort {
length $a <=> length $b ||
$a cmp $b
} #testArray;
That is exactly as the English: sort based on length of a vs b, then by a compared to b.

my #sorted =
sort {
length($a) <=> length($b)
||
$a cmp $b
}
#unsorted;
or
# Strings must be have no characters above 255, and
# they must be shorter than 2^32 characters long.
my #sorted =
map substr($_, 4),
sort
map pack("N/a*", $_),
#unsorted;
or
use Sort::Key::Maker sort_by_length => sub { length($_), $_ }, qw( int str );
my #sorted = sort_by_length #unsorted;
The second is the most complicated, but it should be the fastest. The last one should be faster than the first.

Related

How do you Compare Key Value Pairs Ruby

What is the easiest way to compare each key-value pair in a hash in Ruby, to one another?
For example,
I want to sort this code so the highest three values are first. If the third spot has values that are all the same, then I want the one greatest
key to go in that spot.
{"Aa"=>1, "DDD"=>1, "DdD"=>1, "aA"=>1, "aa"=>1, "bb"=>1, "cC"=>1, "cc"=>1, "ddd"=>3, "e"=>7}
I need the above hash to be {"e"=>7, "ddd"=>3, "aa"=>1}
One string - last string makes what u want. Add .reverse after .sort for changing sort direction.
That is a solution:
# two lines for test
m = %w(a a a a a DDD DD ddd Ddd ddd e e e cC cC cC cC b x XXX XXX XXX ZZZ ZZZ ZZZ)
m = %w(n n n KKK KKK KKK KKK KKK LLL LLL LLL kk kk kk kk kk kk kk kk)
m = m.inject(Hash.new(0)) {|h, n| h.update(n => h[n]+1)}
m = m.sort.sort_by {|k, val| -val}.to_h
I would do it in three steps:
(1) Convert your Hash h to an Array of pairs:
kv_array = h.to_a
(2) Sort the array according to your criterion:
kv_array.sort! do
|left, right|
#.... implement your comparision here
end
(3) Turn your sorted array into a Hash
h_sorted = Hash[kv_array]
as suggested by #tadman, using a group_by then sorting the relevant groups will get you what you want, although you will need to tweak to fit your actual need, as a lot of assumptions were made:
m = {"Aa"=>1, "DDD"=>1, "DdD"=>1, "aA"=>1, "aa"=>1, "bb"=>1, "cC"=>1, "cc"=>1, "ddd"=>3, "e"=>7}
m.group_by { |k,v| v }
.each_with_object([]) {|a,x| x << [ a[1].compact.map { |b| b[0] }.min, a[0] ] }
.sort{|a| a[1]}
.to_h
=> {"e"=>7, "ddd"=>3, "Aa"=>1}
explanation:
firstly we group by the value (which returns a hash with the value as the first key, and an array of the hashes that match)
then we collect the grouped hash, and find the "first" key for each grouped value (using min) ... * this is an assumption * ... returning the whole thing as an array
then we sort the new array based on the "value"
then we convert the array to a hash
I added in new-lines to aid readability (hopefully)
I am so excited to get so many great suggestions! Thank you! This entire problem was to take a string and output the top three occurring words in an array. But, no special characters were allowed unless they were an apostrophe that was within the word. If there were multiple values and they were in the top three you had to pick the one that came closest to "a". In the beginning, I used the tally method to add everything up really quickly, and my plan was then to sort the hash by the value, but then when the values were the same I couldn't put the right key-value pair where it had to be if they shared the same value.
So, I came here and asked about sorting a hash, and then realized I needed to scratch my entire approach altogether. In the end, I found that I could sort the hash to an extent, but not to the place I needed/wanted so this is what I came up with!
def top_3_words(str)
str.scan(/[\w]+'?[\w]*/).sort.slice_when{|a, b| a != b}.max(3){|a, b| a.size <=> b.size}.flatten.uniq
end
p top_3_words("a a a b c c d d d d e e e e e") == ["e", "d", "a"]
p top_3_words("e e e e DDD ddd DdD: ddd ddd aa aA Aa, bb cc cC e e e") == ["e", "ddd", "aa"]
p top_3_words(" //wont won't won't ") == ["won't", "wont"]
p top_3_words(" , e .. ") == ["e"]
p top_3_words(" ... ") == []
p top_3_words(" ' ") == []
p top_3_words(" ''' ") == []
p top_3_words("""In a village of La Mancha, the name of which I have no desire to call to
mind, there lived not long since one of those gentlemen that keep a lance
in the lance-rack, an old buckler, a lean hack, and a greyhound for
coursing. An olla of rather more beef than mutton, a salad on most
nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra
on Sundays, made away with three-quarters of his income.""") == ["a", "of", "on"]
My thought process with this was to call scan on the str so I could get everything I didn't want in the strings out of them. Then, I called sort on that return value because the last test case was really big and I needed all the like words together. After that, I called slice_when on that return value and said when a doesn't equal b then slice, so there would be a multi d array that I then could call max on, and because I sorted earlier, the values would be alphabetical so if there was a shared value it would give me the right one. I passed 3 to max to get the top three, and then called flatten so I had one array, and uniq to take out the extra characters!!
This is all so different from my original question, but I thought I would share what I was working on in case it could ever help anyone in the future!!

How to delete all the lines that match specific condition

I have a number of pdb files and I want to keep only those lines that starts with ^FORMUL and if line has C followed by the number that is larger then (C3,C4,C5,C6..100 etc) then I should not print it. Second condition is that within every line sum of C, H and N should be >= 6, then delete it
So overall delete the lines in which C is followed by number more then 2> and sum of C+O+N is >= then 6.
FORMUL 3 HOH *207(H2 O) (print it)
FORMUL 2 SF4 FE4 S4 (print it)
FORMUL 3 NIC C5 H7 N O7 (don't print, there is C5 and sum is more then 6)
FORMUL 4 HOH *321(H2 O) (print it)
FORMUL 3 HEM 2(C34 H32 FE N4 O4) (don't print, there is C34)
I have tried to do it in perl but lines are soo diverse from each other so Im not sure if it is possible to do.
Over all these conditions chould be included together, meaning that all lines in which C>2 and sum>=6 should be deleted.
C1 O5 N3 should be deleted; C3 N1 01 should not be deleted although C is 3.
In perl I don't know how to assign these two conditions. Here I wrote opposite situation not to delete but to print these lines if these two conditions are not met.
#!/usr/bin/perl
use strict;
use warnings;
my #lines;
my $file;
my $line;
open ($file, '<', '5PCZ.pdb') or die $!;
while (my $line = <$file>)
{
if ($line =~ m/^FORMUL/)
{
push (#lines, $line);
}
}
close $file;
#print "#lines\n";
foreach $line(#lines)
{
if ($line eq /"C"(?=([0-2]))/ )
{
elsif ($line eq "Sum of O,N & C is lt 6")
print #lines
}
}
As you've seen, it's probably easier to write this as a filter that prints the lines that you want to keep. I've also written this following the Unix Filter Model (reads from STDIN, writes to STDOUT) as that makes the program far more flexible (and, interestingly, easier to write!)
Assuming that you're running the program on Linux (or similar) and that your code is in an executable file called my_filter (I recommend a more descriptive name!) then you would call it like this:
$ my_filter < 5PCZ.pdb > 5PCZ.pdb.new
The code would look like this:
#!/usr/bin/perl
use strict;
use warnings;
while (<>) { # read from STDIN a line at a time
# Split data on whitespace, but only into four columns
my #cols = split /\s+/, $_, 4;
next unless $cols[0] eq 'FORMUL';
# Now extract the letter stuff into a hash for easy access.
# We extract letters from the final column in the record.
my %letters = $cols[-1] =~ m/([A-Z])(\d+)/g;
# Give the values we're interested in, a default of 0
$letters{$_} //= 0 for (qw[C O N]);
next if $letters{C} > 2
and $letters{C} + $letters{O} + $letters{N} >= 6;
# I think we can then print the line;
print;
}
This seems to give the correct output for your sample data. And I hope the comments make it obvious how to tweak the conditions.
Extended Awk solution:
awk -F'[[:space:]][[:space:]]+' \
'/^FORMUL/{
if ($4 !~ /\<C/) print;
else {
match($4, /\<C[0-9]+/);
c=substr($4, RSTART+1, RLENGTH);
if (c > 2) next;
else {
match($4, /\<O[0-9]+/);
o=substr($4, RSTART+1, RLENGTH);
match($4, /\<N[0-9]+/);
n=substr($4, RSTART+1, RLENGTH);
if (c+o+n < 6) print
}
}
}' 5PCZ.pdb
The output:
FORMUL 3 HOH *207(H2 O)
FORMUL 2 SF4 FE4 S4
FORMUL 4 HOH *321(H2 O)

xyz coordinates manipulation using sed or awk

I have a huge number of plain text files containing Cartesian xyz coordinates of chemical structures. A sample could look like that:
B -1.38372433 0.56274955 2.22204795
B 0.01637488 1.69210489 1.81167819
B 0.29103422 -0.35499374 0.15388510
B 1.14485163 0.19631678 1.74992009
Fe -0.92583118 1.01775624 0.27450973
S -0.35374797 -1.05624221 1.74656393
C -1.87367299 1.66919492 -1.27526252
O -2.42173866 2.04584255 -2.17123145
H -2.54747585 0.75818308 2.22742141
H 0.62677160 -0.81072498 -0.88156036
H 0.38495881 2.74424131 2.19841880
H 2.25808628 0.09159351 1.37282254
In this case, each H atom is bonded to a B atom with a distance of 1.18 angstroms. What I'm supposed to do is to change, in turn, each BH vertex by a P vertex.
Using bash, I'd like to act on all text files at once by taking the coordinates of the first B atom encountered and use it as a point of origin of a sphere and search within a radius of 1.18 Angstroms for the bonded Hydrogen atom, delete this H atom with its coordinates then change the B into a P atom.
An expected output of the above sample would be something like that:
P -1.38372433 0.56274955 2.22204795
B 0.01637488 1.69210489 1.81167819
B 0.29103422 -0.35499374 0.15388510
B 1.14485163 0.19631678 1.74992009
Fe -0.92583118 1.01775624 0.27450973
S -0.35374797 -1.05624221 1.74656393
C -1.87367299 1.66919492 -1.27526252
O -2.42173866 2.04584255 -2.17123145
H 0.62677160 -0.81072498 -0.88156036
H 0.38495881 2.74424131 2.19841880
H 2.25808628 0.09159351 1.37282254
I've done something similar a while back, but that was adding xyz coordinates of a H atom at a distance of 1.2 Angstroms from an existing B atom. what I used back then was:
for i in *.inp; do awk '/^B / { print; if (++count == 1) printf("%-10.8f %-14.8f %-14.8f %s\n", "H", $2+1.2, $3+1.2, $4+1.2); next } { print }' $i > temp/`basename $i`--H.inp; done
However, I'm still not successful in coming up with something similar to solve my current problem.
Any help is really appreciated
Thanks in advance
Perl solution:
#!/usr/bin/perl
use warnings;
use strict;
my #P;
my $deleted;
while (<>) {
my #F = split;
$F[0] = 'P', #P = #F if ! #P && 'B' eq $F[0];
if ('H' eq $F[0] && ! $deleted) {
die "No B found yet!\n" unless #P;
my $close = grep abs($F[$_] - $P[$_]) <= 1.18001, 1, 2, 3;
$deleted = 1, next if 3 == $close;
}
print "#F\n";
}

How to optimize the ruby code iterates several integers in a range?

New to Ruby, and trying to find a 3 digits number "abc":
for a in 0..9
for b in 0..9
for c in 0..9
if a*a+b*b+c*c == (100*a+10*b+c)/11.0
puts "#{a}#{b}#{c}"
end
end
end
end
This is too lengthy, is any way to optimize it, or write it in another "ruby" way?
Solution from: Wolfram Alpha :)
Here's another fun solution. Not really faster, just more compact and perhaps more ruby-like if that was what you were looking for:
(0..9).to_a.repeated_permutation(3).select { |a,b,c|
a*a+b*b+c*c == (100*a+10*b+c)/11.0
}
=> [[0, 0, 0], [5, 5, 0], [8, 0, 3]]
This is equivalent to finding a,b,c such that
100*a + 10*b + c = 11 * (a*a + b*b +c*c)
i.e. 100*a + 10*b + c must be divisible by 11. Simple number theory tells you that when a,b,c are digits, this means that
`a + c - b`
must be a multiple of 11 so
`a + c = b or a + c = 11 +b`
So for a given values of a and b you only need to check two values of c : b -a and 11 +b -a rather than 10. You can cut the search space in two again: if a > b you only need to check the latter of those two values and if a <= b you need only check the former.
Thus instead of checking 1000 triplets of numbers you should only need to check 100, which should be 10 times faster.
for a in 0..9
for b in 0..9
if a > b
c = 11 +b -a
else
c = b - a
end
if a*a+b*b+c*c == (100*a+10*b+c)/11.0
puts "#{a}#{b}#{c}"
end
end
end

Is there effective algorithm that will return all different combination?

EDITED: I meant COMBINATION and not PERMUTATIONS
Is there effective algorithm that will return all different permutations from the given array?
["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", ...]
e.g.: AB,AC,AD,..,DE,..,HI,..,ABC,ABD,...,DEF,..,CDEFG,...,ABCDEFGHIJK,....
I found some algorithms, but they return ALL permutation and not different ones. By different I mean that:
AB & BA are the same permutations
DEF & FED & EFD & DFE are the same permutations,
The best I can think is sort of a binary counter:
A B C
-------
0 0 0 | Empty Set
0 0 1 | C
0 1 0 | B
0 1 1 | BC
1 0 0 | A
1 0 1 | AC
1 1 0 | AB
1 1 1 | ABC
Any given item is either in the combination or not. Think of a boolean flag for each item, saying whether it's in the combination. If you go through every possible list of values of the boolean flags, you've gone through every combination.
Lists of boolean values are also called binary integers.
If you have items A-K, you have 11 items. So, go through all the possible 11-bit numbers. In Java:
for (int flags = 0; flags < (1 << 11); ++flags) {
int x = indexOfSomeItemFromZeroToTen();
boolean isInCombination = ((i >> x) & 1) == 1;
}
Start at 1 instead of 0 if you want to skip the empty combination.
As pointed out by the comments, you are looking to enumerate all subsets, not permutations.
The easiest way to do this is to use a binary counter. For example, suppose you have n-elements, then something like this would work in C:
code:
for(int i=0; i<(1<<n); ++i) {
//Bits of i represent subset, eg. element k is contained in subset if i&(1<<k) is set.
}
I hope this helps answer your question
Here is some Ruby code I've written and had for a while now that iterates through each combination in an array.
def _pbNextComb(comb,length) # :nodoc:
i=comb.length-1
begin
valid=true
for j in i...comb.length
if j==i
comb[j]+=1
else
comb[j]=comb[i]+(j-i)
end
if comb[j]>=length
valid=false
break
end
end
return true if valid
i-=1
end while i>=0
return false
end
#
# Iterates through the array and yields each
# combination of _num_ elements in the array
#
# Takes an array and the number of elemens
# in each combination
def pbEachCombination(array,num)
return if array.length<num || num<=0
if array.length==num
yield array
return
elsif num==1
for x in array
yield [x]
end
return
end
currentComb=[]
arr=[]
for i in 0...num
currentComb[i]=i
end
begin
for i in 0...num
arr[i]=array[currentComb[i]]
end
yield arr
end while _pbNextComb(currentComb,array.length)
end
These are not permutations. The permutations of ABC are {ABC, ACB, BCA, BAC, CAB, CBA}. You are interested in finding all distinct subsets (otherwise known as the power set) of {A,B,C, ..., K, ...}. This is easy to do: each element can be either included or excluded. Here's a recursive algorithm:
power_set(U) =
if U == {} then
{{}};
else
union( { first(U), power_set(tail(U)) }, // include
{ power_set(tail(U)) } // exclude
);

Resources