How are codons read in Grammatical Evolution: are they a sliding window or are they done in chunks? - genetic-algorithm

I'm reading the paper by O'Neill and Ryan and they say this:
each codon represents an integer value where codons are consecutive
groups of 8 bits
So I just want to make sure, two successive codons would be bits 1-8, then 9-16 right? Or would they be bits 1-8, 2-9, 3-10 etc?

I have read "Grammatical Swarm: The generation of programs by social programming" (M. O’Neill, A. Brabazon, 2006). In this paper the authors are quite clear about the genome being represented as an array of integer numbers (8-bit, unsigned) representing codons. A codon with value c is mapped to an expansion rule of a symbol as Rule = c % r, where r is the number of choices for the symbol.
They also describe a genome wrapping operator to make a GE genome easier fit particle swarm optimization method, but this does not seem to be relevant to the original question and does not change to genome representation.

Related

How to quickly find closest matrix

I have an MxN matrix, A, and a set of matrices with the same number of columns, but with varying number of rows, {B1, B2..., Bn}, all using floating-point values. I would like to find the B matrix closest to A as fast as possible.
A trivial but slow implementation could use matrix-wise least-squares, but if the number of matrices B is very high, that becomes a problem for me (two arithmetic functions per element per matrix).
I've read somewhere that Youtube's audio fingerprinting algorithm was using a finite state transducer, but I'm not sure how I would implement that, or if it's overkill (i.e. time-consuming to write).
For example, since the number of columns is the same in A and B, should I try to generate a "letter" from each row of A, creating a "string" of N letters (for the N rows of A) and then do fuzzy string searching through the B set? How then to make sure the number of letters are small enough? What is a good way to fussy match such rows (is "bbbd" closer to "acae" or "dddg")? I guess this is where the incomprehensible transducing comes in...

Repeated DNA sequence

The problem is to find out all the sequences of length k in a given DNA sequence which occur more than once. I found a approach of using a rolling hash function, where for each sequence of length k, hash is computed and is stored in a map. To check if the current sequence is a repetition, we compute it's hash and check if the hash already exist in the hash map. If yes, then we include this sequence in our result, otherwise add it to the hash map.
Rolling hash here means, when moving on to the next sequence by sliding the window by one, we use the hash of previous sequence in a way that we remove the contribution of the first character of previous sequence and add the contribution of the newly added char i.e. the last character of the new sequence.
Input: AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT
and k=10
Answer: {AAAAACCCCC, CCCCCAAAAA}
This algorithm looks perfect, but I can't go about making a perfect hash function so that collisions are avoided. It would be a great help if somebody can explain how to make a perfect hash under any circumstance and most importantly in this case.
This is actually a research problem.
Let's come to terms with some facts
Input = N, Input length = |N|
You have to move a size k, here k=10, sliding window over the input. Therefore you must live with O(|N|) or more.
Your rolling hash is a form of locality sensitive deterministic hashing, the downside of deterministic hashing is the benefit of hashing is greatly diminished as the more often you encounter similar strings the harder it will be to hash
The longer your input the less effective hashing will be
Given these facts "rolling hashes" will soon fail. You cannot design a rolling hash that will even work for 1/10th of a chromosome.
SO what alternatives do you have?
Bloom Filters. They are much more robust than simple hashing. The downside is sometimes they have a false positives. But this can be mitigated by using several filters.
Cuckoo Hashes similar to bloom filters, but use less memory and have locality sensitive "hashing" and worst case constant lookup time
Just stick every suffix in a suffix trie. Once this is done, just output every string at depth 10 that also has atleast 2 children with one of the children being a leaf.
Improve on the suffix trie with a suffix tree. Lookup is not as straightforward but memory consumption is less.
My favorite the FM-Index. In my opinion the cleanest solution uses the Burrows Wheeler Transform. This technique is also used in industryu tools like Bowtie and BWA
Heads-up: This is not a general solution, but a good trick that you can use when k is not large.
The trick is to encrypt the sequence into an integer by bit manipulation.
If your input k is relatively small, let's say around 10. Then you can encrypt your DNA sequence in an int via bit manipulation. Since for each character in the sequence, there are only 4 possibilities, A, C, G, T. You can simply make your own mapping which uses 2 bits to represent a letter.
For example: 00 -> A, 01 -> C, 10 -> G, 11 -> T.
In this way, if k is 10, you won't need a string with 10 characters as hash key. Instead, you can only use 20 bits in an integer to represent the previous key string.
Then when you do your rolling hash, you left shift the integer that stores your previous sequence for 2 bits, then use any bit operations like |= to set the last two bits with your new character. And remember to clear the 2 left most bits that you just shifted, meaning you are removing them from your sliding window.
By doing this, a string could be stored in an integer, and using that integer as hash key might be nicer and cheaper in terms of the complexity of the hash function computation. If your input length k is slightly longer than 16, you may be able to use a long value. Otherwise, you might be able to use a bitset or a bitarray. But to hash them becomes another issue.
Therefore, I'd say this solution is a nice attempt for this problem when the sequence length is relatively small, i.e. can be stored in a single integer or long integer.
You can build the suffix array and the LCP array. Iterate through the LCP array, every time you see a value greater or equal to k, report the string referred to by that position (using the suffix array to determine where the substring comes from).
After you report a substring because the LCP was greater or equal to k, ignore all following values until reaching one that is less than k (this avoids reporting repeated values).
The construction of both, the suffix array and the LCP, can be done in linear time. So overall the solution is linear with respect to the size of the input plus output.
What you could do is use Chinese Remainder Theorem and pick several large prime moduli. If you recall, CRT means that a system of congruences with coprime moduli has a unique solution mod the product of all your moduli. So if you have three moduli 10^6+3, 10^6+33, and 10^6+37, then in effect you have a modulus of size 10^18 more or less. With a sufficiently large modulus, you can more or less disregard the idea of a collision happening at all---as my instructor so beautifully put it, it's more likely that your computer will spontaneously catch fire than a collision to happen, since you can drive that collision probability to be as arbitrarily small as you like.

Is it possible to create an algorithm which generates an autogram?

An autogram is a sentence which describes the characters it contains, usually enumerating each letter of the alphabet, but possibly also the punctuation it contains. Here is the example given in the wiki page.
This sentence employs two a’s, two c’s, two d’s, twenty-eight e’s, five f’s, three g’s, eight h’s, eleven i’s, three l’s, two m’s, thirteen n’s, nine o’s, two p’s, five r’s, twenty-five s’s, twenty-three t’s, six v’s, ten w’s, two x’s, five y’s, and one z.
Coming up with one is hard, because you don't know how many letters it contains until you finish the sentence. Which is what prompts me to ask: is it possible to write an algorithm which could create an autogram? For example, a given parameter would be the start of the sentence as an input e.g. "This sentence employs", and assuming that it uses the same format as the above "x a's, ... y z's".
I'm not asking for you to actually write an algorithm, although by all means I'd love to see if you know one to exist or want to try and write one; rather I'm curious as to whether the problem is computable in the first place.
You are asking two different questions.
"is it possible to write an algorithm which could create an autogram?"
There are algorithms to find autograms. As far as I know, they use randomization, which means that such an algorithm might find a solution for a given start text, but if it doesn't find one, then this doesn't mean that there isn't one. This takes us to the second question.
"I'm curious as to whether the problem is computable in the first place."
Computable would mean that there is an algorithm which for a given start text either outputs a solution, or states that there isn't one. The above-mentioned algorithms can't do that, and an exhaustive search is not workable. Therefore I'd say that this problem is not computable. However, this is rather of academic interest. In practice, the randomized algorithms work well enough.
Let's assume for the moment that all counts are less than or equal to some maximum M, with M < 100. As mentioned in the OP's link, this means that we only need to decide counts for the 16 letters that appear in these number words, as counts for the other 10 letters are already determined by the specified prefix text and can't change.
One property that I think is worth exploiting is the fact that, if we take some (possibly incorrect) solution and rearrange the number-words in it, then the total letter counts don't change. IOW, if we ignore the letters spent "naming themselves" (e.g. the c in two c's) then the total letter counts only depend on the multiset of number-words that are actually present in the sentence. What that means is that instead of having to consider all possible ways of assigning one of M number-words to each of the 16 letters, we can enumerate just the (much smaller) set of all multisets of number-words of size 16 or less, having elements taken from the ground set of number-words of size M, and for each multiset, look to see whether we can fit the 16 letters to its elements in a way that uses each multiset element exactly once.
Note that a multiset of numbers can be uniquely represented as a nondecreasing list of numbers, and this makes them easy to enumerate.
What does it mean for a letter to "fit" a multiset? Suppose we have a multiset W of number-words; this determines total letter counts for each of the 16 letters (for each letter, just sum the counts of that letter across all the number-words in W; also add a count of 1 for the letter "S" for each number-word besides "one", to account for the pluralisation). Call these letter counts f["A"] for the frequency of "A", etc. Pretend we have a function etoi() that operates like C's atoi(), but returns the numeric value of a number-word. (This is just conceptual; of course in practice we would always generate the number-word from the integer value (which we would keep around), and never the other way around.) Then a letter x fits a particular number-word w in W if and only if f[x] + 1 = etoi(w), since writing the letter x itself into the sentence will increase its frequency by 1, thereby making the two sides of the equation equal.
This does not yet address the fact that if more than one letter fits a number-word, only one of them can be assigned it. But it turns out that it is easy to determine whether a given multiset W of number-words, represented as a nondecreasing list of integers, simultaneously fits any set of letters:
Calculate the total letter frequencies f[] that W implies.
Sort these frequencies.
Skip past any zero-frequency letters. Suppose there were k of these.
For each remaining letter, check whether its frequency is equal to one less than the numeric value of the number-word in the corresponding position. I.e. check that f[k] + 1 == etoi(W[0]), f[k+1] + 1 == etoi(W[1]), etc.
If and only if all these frequencies agree, we have a winner!
The above approach is naive in that it assumes that we choose words to put in the multiset from a size M ground set. For M > 20 there is a lot of structure in this set that can be exploited, at the cost of slightly complicating the algorithm. In particular, instead of enumerating straight multisets of this ground set of all allowed numbers, it would be much better to enumerate multisets of {"one", "two", ..., "nineteen", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"}, and then allow the "fit detection" step to combine the number-words for multiples of 10 with the single-digit number-words.

Minimal Difference Patch Algorithm

I'm trying to convey the difference between two bytestreams. I want to minimize the number of bytes in the patch.
(I don't necessarily want to minimize the number of "changes" in the diff, which is what the optimal patch in a levenshtein distance computation would give me.)
The patch would ideally be in a format such that, given the source bytestream and the diff, it would be easy to reconstruct the target bytestream.
Is there a good algorithm for doing this?
Edit: For the record, I've tried sending changes of the form "at spot 506, insert the following the bytes...", where I create a change list from the levenshtein distance algorithm.
The problem I have is that the levenshtein distance algorithm gives me a lot of changes like:
at spot 506 substitute [some bytes1]
at spot 507 do nothing
at spot 508 substitute [some bytes2]
at spot 509 do nothing
at spot 510 substitute [some bytes3]
...
This is because the lev distance algorithm tries to minimize the number of changes. However, for my purposes this instruction set is wasteful. It would probably be better if an algorithm just said,
At spot 506 substitute [some bytes1, [byte at spot 507], some bytes2, [byte at spot 509], some bytes3, ...]
There's probably some way to modify lev distance to favor these types of changes but it seems a little tricky. I could coalesce substituions after getting a changelist (and I'm going to try that) but there may be opportunities to coalesce deletions / inserts too, and it's less obvious how to do that correctly.
Just wondering if there's a special purpose algorithm for this (or if somebody's done a modification of lev distance to favor these types of changes already).
You can do this using pairwise alignment with affine gap costs, which takes O(nm) time for two strings of lengths n and m respectively.
One thing first: There is no way to find a provably minimal patch in terms of bits or bytes used. That's because if there was such a way, then the function shortest_patch(x, y) that calculates it could be used to find a provably minimal compression of any given string s by calling it with shortest_patch('', s), and Kolmogorov complexity tells us that the shortest possible compression of a given string is formally uncomputable. But if edits tend to be clustered in space, as it seems they are here, then it's certainly possible to find smaller patches than those produced using the usual Levenshtein distance algorithm.
Edit scripts
Patches are usually called "edit scripts" in CS. Finding a minimal (in terms of number of insertions plus number of deletions) edit script for turning one string x into another string y is equivalent to finding an optimal pairwise alignment in which every pair of equal characters has value 0, every pair of unequal characters has value -inf, and every position in which a character from one string is aligned with a - gap character has value -1. Alignments are easy to visualise:
st--ing st-i-ng
stro-ng str-ong
These are 2 optimal alignments of the strings sting and strong, each having cost -3 under the model. If pairs of unequal characters are given the value -1 instead of -inf, then we get an alignment with cost equal to the Levenshtein distance (the number of insertions, plus the number of deletions, plus the number of substitutions):
st-ing sti-ng
strong strong
These are 2 optimal alignments under the new model, and each has cost -2.
To see how these correspond with edit scripts, we can regard the top string as the "original" string, and the bottom string as the "target" string. Columns containing pairs of unequal characters correspond to substitutions, the columns containing a - in the top row correspond to insertions of characters, and the columns containing a - in the bottom row correspond to deletions of characters. You can create an edit script from an alignment by using the "instructions" (C)opy, (D)elete, (I)nsert and (S)ubstitute. Each instruction is followed by a number indicating the number of columns to consume from the alignment, and in the case of I and S, a corresponding number of characters to insert or replace with. For example, the edit scripts for the previous 2 alignments are
C2, I1"r", S1"o", C2 and C2, S1"r", I1"o", C2
Increasing bunching
Now if we have strings like mississippi and tip, we find that the two alignments
mississippi
------tip--
mississippi
t---i----p-
both have the same score of -9: they both require the same total number of insertions, deletions and substitutions. But we much prefer the top one, because its edit script can be described much more succinctly: D6, S1"t", C2, D2. The second's edit script would be S1"t", D3, C1, D4, C1, D1.
In order to get the alignment algorithm to also "prefer" the first alignment, we can adjust gap costs so that starting a blocks of gaps costs more than continuing an existing block of gaps. If we make it so that a column containing a gap costs -2 instead of -1 when the preceding column contains no gap, then what we are effectively doing is penalising the number of contiguous blocks of gaps (since each contiguous block of gaps must obviously have a first position). Under this model, the first alignment above now costs -11, because it contains two contiguous blocks of gaps. The second alignment now costs -12, because it contains three contiguous blocks of gaps. IOW, the algorithm now prefers the first alignment.
This model, in which every aligned position containing a gap costs g and the first position in any contiguous block of gap columns costs g + s, is called the affine gap cost model, and an O(nm) algorithm was given for this by Gotoh in 1982: http://www.genome.ist.i.kyoto-u.ac.jp/~aln_user/archive/JMB82.pdf. Increasing the gap-open cost s will cause aligned segments to bunch together. You can play with the various cost parameters until you get alignments (corresponding to patches) that empirically look about right and are small enough.
There are two approaches to solving this kind of problem:
1) Establish a language for X (edit scripts, in this case), and figure out how to minimize the length of the applicable sentence; or,
2) Compute some kind of minimum representation for Y (string differences), and then think up a way to represent that in the shortest form.
The Myers paper demonstrates that for a particular language, finding the minimum set of changes and finding the minimum length of the change representation are the same problem.
Obviously, changing the language might invalidate that assumption, and certain changes might be extremely complicated to apply correctly (for example, suppose the language included the primitive kP which means to remove the next k characters whose indices are prime. For certain diffs, using that primitive might turn out to be a huge win, but the applications are probably pretty rare. It's an absurd example, I know, but it demonstrates the difficulty of starting with a language.
So I propose starting with the minimum change list, which identifies inserts and deletes. We translate that in a straightforward way to a string of commands, of which there are exactly three. There are no indices here. The idea is that we start with a cursor at the beginning of the original string, and then execute the commands in sequence. The commands are:
= Advance the cursor without altering the character it points to
Ic Insert the character `c` before the cursor.
D Delete the character at the cursor.
Although I said there were exactly three commands, that's not quite true; there are actually A+2 where A is the size of the alphabet.
This might result in a string like this:
=========================IbIaInIaInIaDD=D=D============================
Now, let's try to compress this. First, we run-length encode (RLE), so that every command is preceded by a repeat count, and we drop the trailing =s
27=1Ib1Ia1In1Ia1In1Ia2D1=1D1=1D
(In effect, the RLE recreates indices, although they're relative instead of absolute).
Finally, we use zlib to compress the resulting string. I'm not going to do that here, but just to give some idea of the sort of compression it might come up with:
27=1Ib1Ia1In||2D1=1D|
______+| ____+
___<---+
(Trying to show the back-references. It's not very good ascii art, sorry.)
Liv-Zempell is very good at finding and optimizing unexpected repetitions. In fact, we could have just used it instead of doing the intermediate RLE step, but experience shows that in cases where RLE is highly effective, it's better to LZ the RLE than the source. But it would be worth trying both ways to see what's better for your application.
A common approach to this that uses very few bytes (though not necessarily the theoretical optimal number of bytes) is the following:
Pad the bytes with some character (perhaps zero) until they have the same lengths.
XOR the two streams together. This will result in a byte stream that is zero everywhere the bytes are the same and nonzero otherwise.
Compress the XORed stream using any compression algorithm, perhaps something like LZW.
Assuming that the patch you have is a localized set of changes to a small part of the file, this will result in a very short patch, since the bulk of the file will be zeros, which can be efficiently compressed.
To apply the patch, you just decompress the XORed string and then XOR it with the byte stream to patch. This computes
Original XOR (Original XOR New) = (Original XOR Original) XOR New = New
Since XOR is associative and self-inverting.
Hope this helps!
There is a new promising approach to change detection.
The sequence alignment problem is considered to be an abstract model for changes detection in collaborative text editing designed to minimize the probability of merge conflict. A new cost function is defined as the probability of intersection between detected changes and random string.
The result should be more similar to patch length minimization then other known approaches.
It avoids both the known shortcomings of LCS and others approaches.
The cubic algorithm has been proposed.
http://psta.psiras.ru/read/psta2015_1_3-10.pdf

Determine if two chess positions are equal

I'm currently debugging my transposition table for a chess variant engine where pieces can be placed (ie. originally not on the board). I need to know how often I'm hitting key collisions. I'm saving the piece list in each table index, along with the usual hash data. My simple solution for determining if two positions are equal is failing on transpositions because I'm linearly comparing the two piece lists.
Please do not suggest that I should be storing by board-centric instead of piece-centric. I have to store the piece list because of the unique nature of placable and captured pieces. Pieces in those states are like they are occupying an overlapping and position-less location. Please look at the description of how pieces are stored.
// [Piece List]
//
// Contents: The location of the pieces.
// Values 0-63 are board indexes; -2 is dead; -1 is placeable
// Structure: Black pieces are at indexes 0-15
// White pieces are at indexes 16-31
// Within each set of colors the pieces are arranged as following:
// 8 Pawns, 2 Knights, 2 Bishops, 2 Rooks, 1 Queen, 1 King
// Example: piece[15] = 6 means the black king is on board index 6
// piece[29] = -2 means the white rook is dead
char piece[32];
A transposition happens when pieces are moved in a different order, but the end result is the same board position. For example the following positions are equal:
1) first rook on A1; second rook on D7
2) first rook on D7; second rook on A1
The following is a non-optimised general algorithm; and the inner loop is similar to another general problem, but with the added restraint that values in 0-63 will only happen once (ie. only one piece per square).
for each color:
for each piece type:
are all pieces in the same position, disregarding transpositions?
The following comparison does NOT work because of transpositions. What I need is a way to detect transpositions as equal and only report actually different positions.
bool operator==(const Position &b)
{
for (int i = 0; i < 32; i++)
if (piece[i] != b.piece[i])
return false;
return true;
}
Performance/memory is a consideration because the table gets over 100K hits (where keys are equal) per turn and a typical table has 1 million items. Henceforth, I'm looking for something faster than copying and sorting the lists.
There is lot of research done on computer chess and the way to create unique hash for a position is a well know problem with an universal solution used by virtually every chess engine.
What you need to do is use Zobrist Hashing to create a unique (not really unique, but we'll see later why this is not a problem in practice) key for each different positions. The Algorithm applied to chess is explained here.
When you start your program you create what we call zobrist keys. These are 64 bits random integers for each piece/square pairs. In C you would have an 2 dimension array like this :
unsigned long long zobKeys[NUMBER_OF_PIECES][NUMBER_OF_SQUARES];
Each of this key are initialized with a good random number generator (Warning : the random number generator provided with gcc or VC++ are not good enough, use an implementation of the Mersenne Twister).
When the board is empty you arbitrarily set it's hash key to 0, then when you add a piece on the board, say a Rook on A1, you also update the hash key by XORing the zobrist key for a rook on A1 with the hash key of the board. Like this (in C) :
boardHash = boardHash ^ zobKeys[ROOK][A1];
If you later remove the rook from this square you need to reverse what you just did, since a XOR can be reversed by applaying it again, you can simply use the same command again when you remove the piece :
boardHash = boardHash ^ zobKeys[ROOK][A1];
If you move a piece, say the rook on A1 goest to B1, you need to do two XOR, one to remove the rook on A1 and one to add a rook on B2.
boardHash = boardHash ^ zobKeys[ROOK][A1] ^ boardHash ^ zobKeys[ROOK][B1];
This way everytime you modify the board you also modify the hash. It is very efficient. You could also compute the hash from scatch each time by xoring the zobKeys corresponding to all pieces on the board. You will also need to XOR the position of the pawn that can be taken en passant and the status of the rooking capabilities of both side. You do it the same way, by creating zobris keys for each possible values.
This algotitm does not guaranty that each position has a unique hash, however, if you use a good pseudo random number generator, the odds of a collision occuring are so low that even if you let your engine play you whole life there is virtually no chances of a collision ever occuring.
edit: I just red that you are trying to implement this for a variant of chess that has off-board pieces. Zobrist hashing is still the right solution for you. You will have to find a way to incorporate theses information in the hash. You could for example have some keys for the off-the-board pieces :
unsigned long long offTheBoardZobKeys[NUMBER_OF_PIECE][MAXIMUM_NUMBER_OF_ON_PIECE_TYPE];
If you have 2 paws off the board and put one of this pawn on a2, you will have to do 2 operations :
// remove one pawn from the off-the-board
boardHash = boardHash ^ offTheBoardZobKeys[WHITE_PAWN][numberOfWhitePawsOffTheBoard];
// Put a pawn on a2
boardHash = boardHash ^ zobKeys[WHITE_PAWN][A2];
Why not keep an 64 byte string in your database that corresponds to the chessboard layout? Every type of piece, including 'no piece' represents a letter (different caps for both colors, ie ABC for black, abc for white). Board comparison boils down to simple string comparison.
In general, comparing from the chessboard perspective, instead of the piece perspective, will get rid of your transpositions problem!
"do not suggest that I should be storing by board-centric instead of piece-centric".
You're so focused on not doing that, that you miss the obvious solution. Compare board-specific. To compare two position lists L1 and L2, place all elements of L1 on a (temporary) board. Then, for each element of L2, check if it's present on the temporary board. If an element of L2 is not present on the board (and thus in L1), return unequal.
If after removing all elements of L2, there are still pieces left on the board, then L1 must have had elements not present in L2 and the lists are equal. L1 and L2 are only equal when the temporary board is empty afterwards.
An optimization is to check the lengths of L1 and L2 first. Not only will this catch many discrepancies quickly, it also eliminates the need to remove the elemetns of L2 from the baord and the "empty board" check at the end. That is only needed to catch the case where L1 is a true superset of L2. If L1 and L2 have the same size, and L2 is a subset of L1, then L1 and L2 must be equal.
Your main objection to storing the states board-wise is that you have a bag of position-less pieces. Why not maintain a board + a vector of pieces? This would meet your requirements and it has the advantage that it is a canonical representation for your states. Hence you don't need the sorting, and you either use this representation internally or convert to it when you need to compare :
Piece-type in A1
... 63 more squares
Number of white pawns off-board
Number of black pawns off-board
... other piece types
From the piece perspective you could do this:
for each color:
for each piece type:
start new list for board A
for each piece of this piece type on board A
add piece position to the list
start new list for board B
for each piece of this piece type on board B
add piece position to the list
order both lists and compare them
Optimizations can come in different ways. Your advantage is: as soon as you notice a difference: your done!
You could for instance start with a quick and dirty check by summing up all the indexes for all pieces, for both boards. The sums should be equal. If not, there's a difference.
If the sums are equal, you could quickly compare the positions of the unique pieces (King and Queen). Then you could write out (in somewhat complicated if statements) the comparisons for the pieces that are in pairs. All you then have to do is compare the pawns using the above stated method.
And a third option (I really do hope posting 3 answers to one question is ok, stackoverflow-wise ;)):
Always keep your pieces of the same type in index-order, ie the first pawn in the list should always have the lowest index. If a move takes place that breaks this, just flip the pawns positions in the list. The use won't see the difference, a pawn is a pawn.
Now when comparing positions, you can be sure there's no transpositions problem and you can just use your proposed for-loop.
Given your choice of game state representation, you have to sort the black pawns' indices, the white pawns' indices, etc., one way or the other. If you don't do it in the course of creating a new game state, you will have to do it upon comparison. Because you only need to sort a maximum of 8 elements, this can be done quite fast.
There are a few alternatives to represent your game states:
Represent each type of piece as a bit field. The first 64 bits mean that there is a piece of this type on that board coordinate; then there are n bits of "placeable" and n bits of "dead" slots, which have to be filled from one side (n is the number of pieces of this type).
or
Give each type of piece a unique ID, e.g. white pawns could be 0x01. A game state consists of an array of 64 pieces (the board) and two ordered lists of "placeable" and "dead" pieces. Maintaining the order of these lists can be done quite efficiently upon inserting and deleting.
These two alternatives would not have a transposition problem.
Anyway, I have the impression that you are fiddling around with micro-optimizations when you should first get it to work.

Resources