xyz coordinates manipulation using sed or awk - bash

I have a huge number of plain text files containing Cartesian xyz coordinates of chemical structures. A sample could look like that:
B -1.38372433 0.56274955 2.22204795
B 0.01637488 1.69210489 1.81167819
B 0.29103422 -0.35499374 0.15388510
B 1.14485163 0.19631678 1.74992009
Fe -0.92583118 1.01775624 0.27450973
S -0.35374797 -1.05624221 1.74656393
C -1.87367299 1.66919492 -1.27526252
O -2.42173866 2.04584255 -2.17123145
H -2.54747585 0.75818308 2.22742141
H 0.62677160 -0.81072498 -0.88156036
H 0.38495881 2.74424131 2.19841880
H 2.25808628 0.09159351 1.37282254
In this case, each H atom is bonded to a B atom with a distance of 1.18 angstroms. What I'm supposed to do is to change, in turn, each BH vertex by a P vertex.
Using bash, I'd like to act on all text files at once by taking the coordinates of the first B atom encountered and use it as a point of origin of a sphere and search within a radius of 1.18 Angstroms for the bonded Hydrogen atom, delete this H atom with its coordinates then change the B into a P atom.
An expected output of the above sample would be something like that:
P -1.38372433 0.56274955 2.22204795
B 0.01637488 1.69210489 1.81167819
B 0.29103422 -0.35499374 0.15388510
B 1.14485163 0.19631678 1.74992009
Fe -0.92583118 1.01775624 0.27450973
S -0.35374797 -1.05624221 1.74656393
C -1.87367299 1.66919492 -1.27526252
O -2.42173866 2.04584255 -2.17123145
H 0.62677160 -0.81072498 -0.88156036
H 0.38495881 2.74424131 2.19841880
H 2.25808628 0.09159351 1.37282254
I've done something similar a while back, but that was adding xyz coordinates of a H atom at a distance of 1.2 Angstroms from an existing B atom. what I used back then was:
for i in *.inp; do awk '/^B / { print; if (++count == 1) printf("%-10.8f %-14.8f %-14.8f %s\n", "H", $2+1.2, $3+1.2, $4+1.2); next } { print }' $i > temp/`basename $i`--H.inp; done
However, I'm still not successful in coming up with something similar to solve my current problem.
Any help is really appreciated
Thanks in advance

Perl solution:
#!/usr/bin/perl
use warnings;
use strict;
my #P;
my $deleted;
while (<>) {
my #F = split;
$F[0] = 'P', #P = #F if ! #P && 'B' eq $F[0];
if ('H' eq $F[0] && ! $deleted) {
die "No B found yet!\n" unless #P;
my $close = grep abs($F[$_] - $P[$_]) <= 1.18001, 1, 2, 3;
$deleted = 1, next if 3 == $close;
}
print "#F\n";
}

Related

Perl sorting Alpha characters in a special way

I know this question may have been asked a million times but I am stumped. I have an array that I am trying to sort. The results I want to get are
A
B
Z
AA
BB
The sort routines that are available dont sort it this way. I am not sure if it can be done. Here's is my perl script and the sorting that I am doing. What am I missing?
# header
use warnings;
use strict;
use Sort::Versions;
use Sort::Naturally 'nsort';
print "Perl Starting ... \n\n";
my #testArray = ("Z", "A", "AA", "B", "AB");
#sort1
my #sortedArray1 = sort #testArray;
print "\nMethod1\n";
print join("\n",#sortedArray1),"\n";
my #sortedArray2 = nsort #testArray;
print "\nMethod2\n";
print join("\n",#sortedArray2),"\n";
my #sortedArray3 = sort { versioncmp($a,$b) } #testArray;
print "\nMethod3\n";
print join("\n",#sortedArray3),"\n";
print "\nPerl End ... \n\n";
1;
OUTPUT:
Perl Starting ...
Method1
A
AA
AB
B
Z
Method2
A
AA
AB
B
Z
Method3
A
AA
AB
B
Z
Perl End ...
I think what you want is to sort by length and then by ordinal. This is easily managed with:
my #sortedArray = sort {
length $a <=> length $b ||
$a cmp $b
} #testArray;
That is exactly as the English: sort based on length of a vs b, then by a compared to b.
my #sorted =
sort {
length($a) <=> length($b)
||
$a cmp $b
}
#unsorted;
or
# Strings must be have no characters above 255, and
# they must be shorter than 2^32 characters long.
my #sorted =
map substr($_, 4),
sort
map pack("N/a*", $_),
#unsorted;
or
use Sort::Key::Maker sort_by_length => sub { length($_), $_ }, qw( int str );
my #sorted = sort_by_length #unsorted;
The second is the most complicated, but it should be the fastest. The last one should be faster than the first.

How to delete all the lines that match specific condition

I have a number of pdb files and I want to keep only those lines that starts with ^FORMUL and if line has C followed by the number that is larger then (C3,C4,C5,C6..100 etc) then I should not print it. Second condition is that within every line sum of C, H and N should be >= 6, then delete it
So overall delete the lines in which C is followed by number more then 2> and sum of C+O+N is >= then 6.
FORMUL 3 HOH *207(H2 O) (print it)
FORMUL 2 SF4 FE4 S4 (print it)
FORMUL 3 NIC C5 H7 N O7 (don't print, there is C5 and sum is more then 6)
FORMUL 4 HOH *321(H2 O) (print it)
FORMUL 3 HEM 2(C34 H32 FE N4 O4) (don't print, there is C34)
I have tried to do it in perl but lines are soo diverse from each other so Im not sure if it is possible to do.
Over all these conditions chould be included together, meaning that all lines in which C>2 and sum>=6 should be deleted.
C1 O5 N3 should be deleted; C3 N1 01 should not be deleted although C is 3.
In perl I don't know how to assign these two conditions. Here I wrote opposite situation not to delete but to print these lines if these two conditions are not met.
#!/usr/bin/perl
use strict;
use warnings;
my #lines;
my $file;
my $line;
open ($file, '<', '5PCZ.pdb') or die $!;
while (my $line = <$file>)
{
if ($line =~ m/^FORMUL/)
{
push (#lines, $line);
}
}
close $file;
#print "#lines\n";
foreach $line(#lines)
{
if ($line eq /"C"(?=([0-2]))/ )
{
elsif ($line eq "Sum of O,N & C is lt 6")
print #lines
}
}
As you've seen, it's probably easier to write this as a filter that prints the lines that you want to keep. I've also written this following the Unix Filter Model (reads from STDIN, writes to STDOUT) as that makes the program far more flexible (and, interestingly, easier to write!)
Assuming that you're running the program on Linux (or similar) and that your code is in an executable file called my_filter (I recommend a more descriptive name!) then you would call it like this:
$ my_filter < 5PCZ.pdb > 5PCZ.pdb.new
The code would look like this:
#!/usr/bin/perl
use strict;
use warnings;
while (<>) { # read from STDIN a line at a time
# Split data on whitespace, but only into four columns
my #cols = split /\s+/, $_, 4;
next unless $cols[0] eq 'FORMUL';
# Now extract the letter stuff into a hash for easy access.
# We extract letters from the final column in the record.
my %letters = $cols[-1] =~ m/([A-Z])(\d+)/g;
# Give the values we're interested in, a default of 0
$letters{$_} //= 0 for (qw[C O N]);
next if $letters{C} > 2
and $letters{C} + $letters{O} + $letters{N} >= 6;
# I think we can then print the line;
print;
}
This seems to give the correct output for your sample data. And I hope the comments make it obvious how to tweak the conditions.
Extended Awk solution:
awk -F'[[:space:]][[:space:]]+' \
'/^FORMUL/{
if ($4 !~ /\<C/) print;
else {
match($4, /\<C[0-9]+/);
c=substr($4, RSTART+1, RLENGTH);
if (c > 2) next;
else {
match($4, /\<O[0-9]+/);
o=substr($4, RSTART+1, RLENGTH);
match($4, /\<N[0-9]+/);
n=substr($4, RSTART+1, RLENGTH);
if (c+o+n < 6) print
}
}
}' 5PCZ.pdb
The output:
FORMUL 3 HOH *207(H2 O)
FORMUL 2 SF4 FE4 S4
FORMUL 4 HOH *321(H2 O)

Identify start and end of arc out of three points/angles

I have three points that I know to be on a circle that represent the start, middle and end of an arc (a, m and b). I also have the angle of these points, from the positive X axis in an anti-clockwise direction, using atan2(y,x) of the three respective vectors from the centre to the points, so we have -pi < theta < pi.
I also know which of the points is m, and what I want to know is which of a and b is the clockwise end of the arc.
I can see that there are 8 ways the points can be arranged:
"East" "West" "East"
0 -pi | pi 0
---------------+-------------
a m b |
a m | b
a | m b
| a m b
b m a |
b m | a
b | m a
| b m a
where the first four have a as the "end" and b as the "start" and the latter four are the other way around. Bear in mind that orders about can wrap around at 0 and appear on the right or left, so sign is not helpful.
Is there a tidy way to work out which is the start and which is the end point? Other than laboriously checking relative values among each of the 8 options in a big, dense if/else-if block, that is.
Implementation language is Python, but this is not a language specific question!
If m is on the left side of the directed line segment from a to b, then a is the counterclockwise end; otherwise, it's the clockwise end.
That is, take the left perpendicular of the vector ab, and find its dot product with am. If the dot product is positive, a is the CCW endpoint.
Incidentally, the tidiest way to deal with angles is to avoid using them. Vectors and linear algebra out-tidy angles and trigonometry any day of the week.
I have just run into the same problem - thanks for the great writeup of the question. I think solving it the way you were heading is actually not so "laborious".
You can use your table to see it's just a question of which order the three angles are arranged, cyclically. A < M < B or A > M > B.
// = a < m < b ||
// b < a < m ||
// m < b < a;
anticlockwise = a < m && m < b || b < a && a < m || m < b && b < a;
Depending on how your language of choice feels about coercing booleans into integers, you might be able to write that as:
anticlockwise = (a < m + m < b + b < a) === 2;
(Well I found this a lot easier than trying to understand and compute perpendiculars, dot products, linear algebra...)

How to optimize the ruby code iterates several integers in a range?

New to Ruby, and trying to find a 3 digits number "abc":
for a in 0..9
for b in 0..9
for c in 0..9
if a*a+b*b+c*c == (100*a+10*b+c)/11.0
puts "#{a}#{b}#{c}"
end
end
end
end
This is too lengthy, is any way to optimize it, or write it in another "ruby" way?
Solution from: Wolfram Alpha :)
Here's another fun solution. Not really faster, just more compact and perhaps more ruby-like if that was what you were looking for:
(0..9).to_a.repeated_permutation(3).select { |a,b,c|
a*a+b*b+c*c == (100*a+10*b+c)/11.0
}
=> [[0, 0, 0], [5, 5, 0], [8, 0, 3]]
This is equivalent to finding a,b,c such that
100*a + 10*b + c = 11 * (a*a + b*b +c*c)
i.e. 100*a + 10*b + c must be divisible by 11. Simple number theory tells you that when a,b,c are digits, this means that
`a + c - b`
must be a multiple of 11 so
`a + c = b or a + c = 11 +b`
So for a given values of a and b you only need to check two values of c : b -a and 11 +b -a rather than 10. You can cut the search space in two again: if a > b you only need to check the latter of those two values and if a <= b you need only check the former.
Thus instead of checking 1000 triplets of numbers you should only need to check 100, which should be 10 times faster.
for a in 0..9
for b in 0..9
if a > b
c = 11 +b -a
else
c = b - a
end
if a*a+b*b+c*c == (100*a+10*b+c)/11.0
puts "#{a}#{b}#{c}"
end
end
end

Minimal-change algorithm which maximises 'swapping'

This is a question on combinatorics from a non-mathematician, so please try to bear with me!
Given an array of n distinct characters, I want to generate subsets of k characters in a minimal-change order, i.e. an order in which generation i+1 contains exactly one character that was not in generation i. That's not too hard in itself. However, I also want to maximise the number of cases in which the character that is swapped out in generation i+1 is the same character that was swapped in in generation i. To illustrate, for n=7, k=3:
abc abd abe* abf* abg* afg aeg* adg* acg* acd ace* acf* aef adf* ade
bde bdf bef bcf* bce bcd* bcg* bdg beg* bfg* cfg ceg* cdg* cde cdf* cef def deg dfg efg
The asterisked strings indicate the case I want to maximise; e.g. the e that is new in generation 3, abe, replaces a d that was new in generation 2, abd. It doesn't seem possible to have this happen in every generation, but I want it to happen as often as possible.
Typical array sizes that I use are 20-30 and subset sizes around 5-8.
I'm using an odd language, Icon (or actually its derivative Unicon), so I don't expect anyone to post code that I can used directly. But I will be grateful for answers or hints in pseudo-code, and will do my best to translate C etc. Also, I have noticed that problems of this kind are often discussed in terms of arrays of integers, and I can certainly apply solutions posted in such terms to my own problem.
Thanks
Kim Bastin
Edit 15 June 2010:
I do seem to have got into deeper water than I thought, and while I'm grateful for all answers, not all of them have been relevant. As an example of a solution which is NOT adequate, let me post my own Unicon procedure for generating k-ary subsets of a character set s in a minimal change order. Things you need to know to understand the code are: a preposed * means the size of a structure, so if s is a string, *s means the size of s (the number of characters it contains). || is a string concatenation operation. A preposed ! produces each element of a structure, e.g. each character of a string, in turn on successive passes. And the 'suspend' control structure returns a result from a procedure, but leaves the procedure 'in suspense', with all local variables in place, so that new results can be produced if the procedure is called in a loop.
procedure revdoor(s, k)
# Produces all k-subsets of a string or character set s in a 'revolving
# door' order. Each column except the first traverses the characters
# available to it in alphabetical and reverse alphabetical order
# alternately. The order of the input string is preserved.
# If called in a loop as revdoor("abcdefg", 3),
# the order of production is: abc, abd, abe, abf, abg, acg, acf, ace, acd,
# ade, adf, adg, aeg, aef, afg, bfg, bef, beg, bdg, bdf, bde, bcd, bce,
# bcf, bcg, cdg, cdf, cde, cef, ceg, cfg, dfg, deg, def, efg
local i
static Ctl
if /Ctl then { # this means 'if Ctl doesn't exist'
if k = 0 then return ""
Ctl := list(k, 1) # a list of k elements, each initialised to 1.
}
if Ctl[k] = 1 then {
if k = 1 then suspend !s else
every i := 1 to *s-k+1 do {
suspend s[i] || revdoor(s[i+1:0], k-1)
}
} else {
if k = 1 then suspend !reverse(s) else
every i := -k to -*s by -1 do {
suspend s[i] || revdoor(s[i+1:0], k-1)
}
}
# the following line multiplies element k of Ctl by -1 if k < size of Ctl
# (this controls the order of generation of characters),
# and destroys Ctl on final exit from the procedure.
if k < *Ctl then Ctl[k] *:= -1 else Ctl := &null
end
Note that the output of the above procedure is not optimal in my sense. One result of my investigations so far is that the maximum 'swapping score' for a list of k-ary subsets of n elements is not less than comb(n-1, k), or in the case of n=7, k=3, the maximum score is at least comb(6, 3) = 20. I define the 'swapping score' of a list as the number of items in the list whose new element replaces an element in the previous item which was itself new. I haven't got the mathematical equipment to prove this, but it is easy to see with k=1 or k=2. For certain (n,k) a slightly higher score is possible, as in the case of n=7, k=3:
abc abd abe abf abg
acg adg aeg afg
efg dfg cfg bfg
beg bdg bcg
bcd bce bcf
bdf bef
def cef aef
adf acf
acd ace
ade
bde cde
cdf cdg
ceg
deg (swapping score = 21)
It may be noted that the above list is in 'strong minimal change order' (like word golf: the new character is always in the same position as the character it replaces), which may indicate the direction my own work is taking. I hope to post something more in a few days.
Kim
It's fairly straightforward. In order to maximise replacement just think of the characters as numbers and increment the string by one till you have reached the upper limit.
Then check to see that you don't use the same character twice in the string.
I think this would work:
char c[] = {'a', 'b', 'c', 'd', 'e'};
const int n = 5;
const int k = 3;
char s[k];
void print()
{
for( int i = 0; i < k; ++i )
putchar(c[s[i]]);
putchar('\n');
}
bool increment( int m )
{
// reached the limit?
if( ++s[m] == n && m == 0 )
return false;
next:
for( int i = 0; i < m; ++i )
{
if( s[m] == n )
{
// carry
s[m] = 0;
if( !increment( m-1 ))
return false;
goto next;
}
else if( s[i] == s[m] )
{
// the character is already used
++s[m];
goto next;
}
}
return true;
}
int main(int, char**)
{
// initialise
for( int i = 0; i < k; ++i )
s[i] = i;
// enumerate all combinations
do
print();
while(increment(k-1));
}
Kim, your problem description sounds very much like a (homework) attempt to describe the simplest solution for enumerating all k-combinations of a set of n elements - without giving the actual solution away too easily. Anyway, see below for my shot. I used Java but the important parts are not different from C.
public class Homework
{
/**
* Prints all k-combinations of a set of n elements. Answer to this
* question: http://stackoverflow.com/questions/2698551
*/
public static void main(String[] args)
{
Combinations combinations = new Combinations(7, 3);
System.out.printf(
"Printing all %d %d-combinations of a set with %d elements:\n",
combinations.size(), combinations.k, combinations.n);
for (int[] c : combinations)
System.out.println(Arrays.toString(c));
}
/**
* Provides an iterator for all k-combinations of a set of n elements.
*/
static class Combinations implements Iterable<int[]>
{
public final int n, k;
public Combinations(int n, int k)
{
if (n < 1 || n < k)
throw new IllegalArgumentException();
this.n = n;
this.k = k;
}
#Override
public Iterator<int[]> iterator()
{
return new Iterator<int[]>()
{
private int[] c;
#Override
public void remove() { throw new UnsupportedOperationException(); }
#Override
public int[] next()
{
if (c == null)
{
c = new int[k];
for (int i = 0; i < k; i++)
c[i] = i;
}
else
{
int i = c.length - 1;
while (i >= 0 && c[i] == n - k + i)
i--;
if (i < 0)
throw new NoSuchElementException();
c[i]++;
for (int j = i + 1; j < c.length; j++)
c[j] = c[i] + j - i;
}
return c.clone(); // remove defensive copy if performance is more important
}
#Override
public boolean hasNext() { return c == null || c[0] < n - k; }
};
}
/**
* Returns number of combinations: n! / (k! * (n - k)!).
*/
public BigInteger size()
{
BigInteger s = BigInteger.valueOf(n);
for (int i = n - 1; i > n - k; i--)
s = s.multiply(BigInteger.valueOf(i));
for (int i = k; i > 1; i--)
s = s.divide(BigInteger.valueOf(i));
return s;
}
}
}
Here is the output for your example:
Printing all 35 3-combinations of a set with 7 elements:
[0, 1, 2] [0, 1, 3] [0, 1, 4] [0, 1, 5] [0, 1, 6] [0, 2, 3] [0, 2, 4] [0, 2, 5] [0, 2, 6] [0, 3, 4]
[0, 3, 5] [0, 3, 6] [0, 4, 5] [0, 4, 6] [0, 5, 6] [1, 2, 3] [1, 2, 4] [1, 2, 5] [1, 2, 6] [1, 3, 4]
[1, 3, 5] [1, 3, 6] [1, 4, 5] [1, 4, 6] [1, 5, 6] [2, 3, 4] [2, 3, 5] [2, 3, 6] [2, 4, 5] [2, 4, 6]
[2, 5, 6] [3, 4, 5] [3, 4, 6] [3, 5, 6] [4, 5, 6]
Rather than start with an algorithm, I've tried to think of a way to find a form for the maximum "swapping score", so that you know what to shoot for. Often an algorithm for producing the desired structure can emerge from such a proof.
It's been a long time since university, but I've tried to think of a combinatorial model that will help to figure this out, without very much luck.
I started by imagining the set of combinations as vertices in a graph, with a edges corresponding to the "adjacency" (only one element difference) of the combinations. So:
"n choose k" vertices
each vertex has degree k(n-k)
number of edges = "n choose k" * k(n-k) / 2 = "n choose 2" * "n-2 choose k-1"
There's a lot of symmetry to these graphs. The graph is the same for any given {n,k} as it is for {n,n-k}. If k=1 or k=n-1 it's the complete graph on n vertices (each combinations differs from all the others by only one character). I can't see an obvious algorithm from this, though.
Edit: My next thought was to conceive the graph with a slightly different interpretation. You can think of each {n,k}-combination as a sequence of n bits where there are k 1s. The position of the 1s corresponds to which of the n characters is present in the combination. So for n=7 k=3, abc is 1110000, adg is 1001001, efg is 0000111. With this you can also imagine the points lying at the corners of an n-dimensional hypercube. So for a given subsequence, the edges match your "minimal swapping" criteria if they are co-planar: I think of them as "cutting planes" through the hypercube.
You are looking for a Hamiltonian path through this graph of combinations, one that meets your special criteria.
Another way to think of your problem is to minimize the number of times in the sequence that you do change which character in the combination is being altered.
For a good answer, would computing the list of combinations all at once be acceptable, or do you need to compute them one at a time? In other words, do you need a function:
Combination nextCombo();
or would
vector<Combination> allCombinations();
be acceptable?
If computing the combinations in batch is permissible, it is possible that an iterative-deepening A* search (or just an A* search but I suspect it'd run out of memory) would work. With an admissible heuristic, A* is guaranteed to give the optimum. I'm short of time, so I decided to post this partial answer and edit the post if I get time to write code.
A* is a graph search algorithm. In this case, the nodes are lists of combinations used so far (no duplicates allowed in the list). My plan was to use a bit-string representation for the nodes. n=30 would fit into a 32 bit integer. We can arbitrarily permute any solution so that the first combination begins with 0's and ends in 1's, i.e. 000...1111. A node with a shorter list is connected to a longer one if the two lists are the same up until the last element and the last element differs only in having one 0'bit flipped to a 1 and one 1 bit flipped to a 0. The distance between the two is 0 if there was a "swap" and 1 if there was a swap.
A second representation for each combination is a sorted list of the bits that are turned on. One possible admissible heuristic for this graph is to use this index list. Each time (in the list of combinations) an index is used at a particular position in the index list, mark it off. The number of positions with un-used indices - the current last changed index is (I believe) the minimal number of swaps that need to happen.
To illustrate this heuristic, consider the sequence abc abd abe* abf* abg afg from above. (the letters would be numbers in my treatment, but that is a minor difference). This sequence (which would be one node in the search-graph) would have the following places marked:
1 2 3
*a
b *b
c c *c
d d *d
e e *e
*f *f
*g
Thus the heuristic would predict that there is at least one swap required (since there are no unmarked elements in position 3 and the current position is 2).
If I get the time, I'll edit this to post code and performance of the algorithm.
Re: the NP completeness result (in a comment to Zac Thompson's answer). The graph on which we are searching for a minimal cost Hamiltonian path has a very special structure. For example, the normally NP-complete Hamiltonian Path problem can be solved in O(n) time with the "enumerate all combinations" algorithm - with n being the number of nodes in the graph. This structure makes it possible that, though on a general graph, vertex cover is hard, on your graph it may be polynomial (even linear or quadratic). Of course, since the graph has a lot of nodes for e.g. n=30, k=8 you may still have a lot of computation ahead of you.
I worked on this problem in 2010 but failed to find a solution then. A few days ago I had another look at some of my notes from that time and suspected I had been very close to a solution. A few minutes later I had the key.
To recapitalute: the requirement is a strict minimal change ordering of the k-subsets of a string s such that LIFO (last in first out) is maximised. I refer to this as maximised ‘swapping’ in earlier posts.
I call the algorithm maxlifo (maximised LIFO) after the key requirement. It takes two parameters, a string s, which must not contain duplicated characters, and a positive integer k not greater than the size of s. The algorithm is recursive, i.e. maxlifo(s, k) uses the output of maxlifo(s, k-1) down to k=1. Output is returned as a list.
Below I give an informal explanation, with examples, using the string "abcdefg" and various values of k. This is followed by an example of implementation as a Unicon procedure. (I’m not fluent in any of the more commonly used languages.)
The case k=1 is trivial — it returns the elements of s in order from first to last.
For k>1, apply the following rules to the output of maxlifo(s, k-1):
(1) For each element of the output of maxlifo(s, k-1), list in a row the k-subsets built from that element with each missing character of s in turn. The order of characters in the subsets is as in s.
(2) Working from the second row down, substitute blank ‘placeholders’ for all occurrences of subsets that appear in an earlier row. Each k-subset of s now appears just once.
(3) In each non-blank row, mark with an initial ! each subset such that there is a placeholder at the same position in the next row. This marking means ‘first’. Exactly one subset will be so marked in each non-blank row.
(4) Delete all rows that are completely blank (contain only placeholders).
(5) In each remaining row except the last, mark with a final ! the subset in the position corresponding to the subset marked ‘first’ in the next lower row. This marking means ‘last’.
Now we can list the final maxlifo ordering of the subsets. Each row from top to bottom is ordered and its elements added in that order to the output list.
(6) In each row from the top down:
(6.1) Remove all blank placeholders.
(6.2) Add to the output list the subset marked ‘first’ (initial !) and remove it from the row.
(6.3) If there are still subsets remaining in the row, either the leftmost or the rightmost subset will be marked ‘last’ (final !). If the rightmost subset is marked ‘last’, add the subsets to the output list in order from left to right, otherwise in order from right to left.
(7) After processing all rows, return the output list.
Example using maxlifo("abcdefg", 2):
Col1 contains the output of maxlifo("abcdefg", 1). The rows of Col2 contain the cliques formed with the remaining characters of s:
Col1 Col2
---- ----------------------------
a ab ac ad ae af ag
b ab bc bd be bf bg
c ac bc cd ce cf cg
d ad bd cd de df dg
e ae be ce de ef eg
f af bf cf df ef fg
g ag bg cg dg eg fg
Blank out subsets that appear in an earlier row:
a ab ac ad ae af ag
b bc bd be bf bg
c cd ce cf cg
d de df dg
e ef eg
f fg
g
Mark the ‘first’ subset in each row (the one with a blank below it):
a !ab ac ad ae af ag
b !bc bd be bf bg
c !cd ce cf cg
d !de df dg
e !ef eg
f !fg
g
Delete all completely blank rows (only one in this case):
a !ab ac ad ae af ag
b !bc bd be bf bg
c !cd ce cf cg
d !de df dg
e !ef eg
f !fg
Mark the ’last’ subset in each row (the one with a ‘first’ subset below it).
a !ab ac! ad ae af ag
b !bc bd! be bf bg
c !cd ce! cf cg
d !de df! dg
e !ef eg!
f !fg
Output each row in the order described above: ‘first’, unmarked, ’last’:
Ordered rows:
a !ab ac! ad ae af ag ab ag af ae ad ac
b !bc bd! be bf bg bc bg bf be bd
c !cd ce! cf cg cd cg cf ce
d !de df! dg de dg df
e !ef eg! ef eg
f !fg fg
Output: [ab ag af ae ad ac bc bg bf be bd cd cg cf ce df dg de ef eg fg]
Examples for 3 <= k <= 6 are given in less detail. The blank rows deleted in step 4 are left in place.
maxlifo("abcdefg", 3):
Ordered rows:
ab !abc abd abe abf abg! abc abd abe abf abg
ag acg adg aeg! !afg afg acg adg aeg
af acf adf! !aef aef acf adf
ae ace! !ade ade ace
ad !acd! acd
ac
bc !bcd bce bcf bcg! bcd bce bcf bcg
bg bdg beg! !bfg bfg bdg beg
bf bdf! !bef bef bdf
be !bde! bde
bd
cd !cde cdf cdg! cde cdf cdg
cg ceg! !cfg cfg ceg
cf !cef! cef
ce
de !def deg! def deg
dg !dfg! dfg
df
ef !efg efg
eg
fg
Output: [abc abd abe abf abg afg acg adg aeg aef acf adf ade ace acd
bcd bce bcf bcg bfg bdg beg bef bdf bde
cde cdf cdg cfg ceg cef
def deg dfg
efg]
maxlifo("abcdefg", 4):
Ordered rows:
abc !abcd abce! abcf abcg abcd abcg abcf abce
abd !abde abdf! abdg abde abdg abdf
abe !abef abeg! abef abeg
abf !abfg! abfg
abg
afg acfg! adfg !aefg aefg adfg acfg
acg !acdg aceg! acdg aceg
adg !adeg! adeg
aeg
aef acef! !adef adef acef
acf !acdf! acdf
adf
ade !acde! acde
ace
acd
bcd !bcde bcdf! bcdg bcde bcdg bcdf
bce !bcef bceg! bcef bceg
bcf !bcfg! bcfg
bcg
bfg bdfg! !befg befg bdfg
bdg !bdeg! bdeg
beg
bef !bdef! bdef
bdf
bde
cde !cdef cdeg! cdef cdeg
cdf !cdfg! cdfg
cdg
cfg !cefg! cefg
ceg
cef
def !defg defg
deg
dfg
efg
Output: [abcd abcg abcf abce abde abdg abdf abef abeg abfg aefg adfg acfg acdg aceg adeg adef acef acdf acde
bcde bcdg bcdf bcef bceg bcfg befg bdfg bdeg bdef
cdef cdeg cdfg cefg
defg]
maxlifo("abcdefg", 5):
Ordered rows:
abcd !abcde abcdf abcdg! abcde abcdf abcdg
abcg abceg! !abcfg abcfg abceg
abcf !abcef! abcef
abce
abde !abdef abdeg! abdef abdeg
abdg !abdfg! abdfg
abdf
abef !abefg! abefg
abeg
abfg
aefg acefg! !adefg adefg acefg
adfg !acdfg! acdfg
acfg
acdg !acdeg! acdeg
aceg
adeg
adef !acdef! acdef
acef
acdf
acde
bcde !bcdef bcdeg! bcdef bcdeg
bcdg !bcdfg! bcdfg
bcdf
bcef !bcefg! bcefg
bceg
bcfg
befg !bdefg! bdefg
bdfg
bdeg
bdef
cdef !cdefg cdefg
cdeg
cdfg
cefg
defg
Output: [abcde abcdf abcdg abcfg abceg abcef abdef abdeg abdfg abefg adefg acefg acdfg acdeg acdef
bcdef bcdeg bcdfg bcefg bdefg
cdefg]
maxlifo("abcdefg", 6):
Ordered rows:
abcde !abcdef abcdeg! abcdef abcdeg
abcdf !abcdfg! abcdfg
abcdg
abcfg !abcefg! abcefg
abceg
abcef
abdef !abdefg! abdefg
abdeg
abdfg
abefg
adefg
acefg !acdefg! acdefg
acdfg
acdeg
acdef
bcdef !bcdefg bcdefg
bcdeg
bcdfg
bcefg
bdefg
cdefg
Output: [abcdef abcdeg abcdfg abcefg abdefg acdefg bcdefg]
Unicon implementation:
procedure maxlifo(s:string, k:integer)
# A solution to my combinatorics problem from 2010.
# Return a list of the k subsets of the characters of a string s
# in a minimal change order such that last-in first-out is maximised.
# String s must not contain duplicate characters and in the present
# implementation must not contain "!", which is used as a marker.
local ch, cand, Hit, inps, i, j, K, L, Outp, R, S
# Errors
if *cset(s) ~= *s then
stop("Duplicate characters in set in maxlifo(", s, ", ", k, ")")
if find("!", s) then
stop("Illegal character in set in maxlifo(", s, ", ", k, ")")
if k > *s then
stop("Subset size larger than set size in maxlifo(", s, ", ", k, ")")
# Special cases
if k = 0 then return []
if k = *s then return [s]
Outp := []
if k = 1 then {
every put(Outp, !s)
return Outp
}
# Default case
S := set()
K := []
# Build cliques from output of maxlifo(s, k-1) with the remaining
# characters in s, substituting empty strings as placeholders for
# subsets already listed.
every inps := !maxlifo(s, k-1) do {
R := []
every ch := !s do
if not find(ch, inps) then {
cand := reorder(inps ++ ch, s)
if member(S, cand) then cand := "" else insert(S, cand)
put(R, cand)
}
put(K, R)
}
# Mark ‘first’ subset in each row with initial "!"
every i := 1 to *K - 1 do {
every j := 1 to *K[i] do
if K[i, j] ~== "" & K[i+1, j] == "" then {
K[i, j] := "!" || K[i, j]
break
}
}
# Remove rows containing only placeholders
every i := *K to 1 by -1 do {
every if !K[i] ~== "" then break next
delete(K, i)
}
# Mark ‘last’ subset in each row with final "!"
every i := 1 to *K - 1 do
every j := 1 to *K[i] do
if K[i+1, j][1] == "!" then {
K[i, j] ||:= "!"
break
}
# Build output list
every R := !K do {
# Delete placeholders from row (no longer needed and in the way)
every j := *R to 1 by -1 do if R[j] == "" then delete(R, j)
# Handle ‘first’ subset and remove from row
# N.B. ‘First’ subset will be leftmost or rightmost in row
if R[1][1] == "!" then
put(Outp, trim(get(R), '!', 0))
else put(Outp, trim(pull(R), '!', 0))
# Handle any remaining subsets, ‘last’ subset last, stripping '!' markers
# N.B. ‘Last’ subset will be leftmost or rightmost in row after removal
# of ‘first’ subset.
if R[-1][-1] == "!" then while put(Outp, trim(get(R), '!', 0)) else
while put(Outp, trim(pull(R), '!', 0))
}
return Outp
end
procedure reorder(cs:cset, s:string)
# Reorder cset cs according to string s
local r
# If no s, return set in alphabetical order
if /s then return string(cs)
r := ""
s ? while tab(upto(cs)) do r ||:= move(1)
return r
end

Resources