Sorting large text file in python - algorithm

Sort the content of a file based on second field, e.g.
Input file:
Jervie,12,M
Jaimy,11,F
Tony,23,M
Janey,11,F
Output file:
Jaimy,11,F
Janey,11,F
Jervie,12,M
Tony,23,M
We need to use external sort.
Input file can be of size 4GB. RAM is 1GB.
I used this but it does not work as it treats all the content as int. Also I have doubt related to the buffer size in each turn of the external sort. How to decide on that?
This sorts file with integers only.
file = open("i2.txt","r")
temp_files = []
e = []
while True:
temp_file = tempfile.TemporaryFile()
e = list(islice(file,2))
if not e:
break
e.sort(key=lambda line: int(line.split()[0]))
temp_file.writelines(e)
temp_files.append(temp_file)
temp_file.flush()
temp_file.seek(0)
file.close()
with open('o.txt', 'w') as out:
out.writelines(imap('{}\n'.format, heapq.merge(*(imap(int, f) for f in temp_files))))
out.close()
I am able to create temporary files sorted on the second field, but how do I merge them based on that?

I did it with the following code :
Divide the big file into smaller files. Here it is assumed that max 4 lines can be read. So I initially divide the file into lines of 4 and sort them and write into temp files. Then read these files in pairs 2 lines from each file and merge them. Corner cases are not handled but, this should be a starter for others to think.
f = open("i1.txt", "r")
temp_files = []
e = []
while True:
temp_file = tempfile.NamedTemporaryFile()
e = list(islice(f, 4))
if not e:
temp_file.close()
break
# e.sort(key=lambda line:int(line.split()[1]))
e.sort(key=lambda line: int(line.split()[1]))
temp_file.writelines(e)
temp_files.append(temp_file)
temp_file.flush()
temp_file.seek(0)
f.close()
aux = []
z = 0
while len(temp_files) != 1:
while z < len(temp_files)-1:
tem = tempfile.NamedTemporaryFile()
t1 = temp_files[z]
t2 = temp_files[z+1]
t1.seek(0)
t2.seek(0)
n = 2
e1 = None
e2 = None
while True:
if not e1:
e1 = list(islice(t1, 2))
if not e2:
e2 = list(islice(t2, 2))
if not e1 and not e2:
break
elif e1 and not e2:
tem.writelines(imap('{}'.format,e1))
e1 = None
continue
elif not e1 and e2:
tem.writelines(imap('{}'.format,e2))
e2 = None
continue
i = 0
j = 0
while i<len(e1) and j<len(e2):
l1 = e1[i]
l2 = e2[j]
if int(l1.split()[1]) == int(l2.split()[1]):
tem.writelines(imap('{}'.format,[l1,l2]))
i+=1
j+=1
elif int(l1.split()[1]) < int(l2.split()[1]):
tem.writelines(imap('{}'.format,[l1]))
i+=1
else:
tem.writelines(imap('{}'.format,[l2]))
j+=1
if i>=len(e1):
e1 = None
else:
e1 = e1[i:]
if j>= len(e2):
e2 = None
else:
e2 = e2[j:]
z+=2
aux.append(tem)
t1.close()
t2.close()
tem.flush()
tem.seek(0)
temp_files = aux
z = 0
aux = []
with open("o.txt",'w') as out:
out.writelines(imap('{}'.format,temp_files[0]))

Try using out of the core processing with Blaze (http://blaze.readthedocs.io/en/latest/ooc.html)

Related

get pairs / triple / quadruple... of elements from vector by function

I have a vector with a couple of elements and I want to write a function that returns me all combinations of x items from this vector.
The following code produces the right output for the case x=2 or x=3 or x=4.
However, I can not implement a solution for every possible x following this idea.
values = {'A','B','C','D','E'};
n = length(values);
data2 = {}; % case x=2
for i = 1:n
for j = i+1:n
data2{end+1} = {values{i}, values{j}};
fprintf('%s %s\n',values{i}, values{j})
end
end
data3 = {}; % case x=3
for i = 1:n
for j = i+1:n
for k = j+1:n
data3{end+1} = {values{i}, values{j}, values{k}};
fprintf('%s %s %s\n',values{i}, values{j}, values{k})
end
end
end
data4 = {}; % case x=4
for i = 1:n
for j = i+1:n
for k = j+1:n
for l = k+1:n
data4{end+1} = {values{i}, values{j}, values{k}, values{l}};
fprintf('%s %s %s %s\n',values{i}, values{j}, values{k}, values{l})
end
end
end
end
How would a function look like which would be able to return my data variable?
data = getCombinations(values, x) %values is vector with elements, x is integer value
EDIT
The following code comes pretty close:
data = perms(values)
data = data(:,1:x)
data = unique(data,'rows')
but it still produces output like A,B and B,A
EDIT2
This fixed it somehow but it is not very nice to look at and it does not work for text entries in cells but only for numbers
data = perms(values)
data = data(:,1:x)
data = sort(data,2)
data = unique(data,'rows')
EDIT3
This did it but it is not very nice to look at... Maybe there is a better solution?
function [data] = getCombinations(values,x)
i = 1:length(values);
d = perms(i);
d = d(:,1:x);
d = sort(d,2);
d = unique(d,'rows');
data = v(d);
end
If you don't want repetitions (and your example suggests you don't) then try nchoosek as nchoosek(1:n, x) to give indices:
values = {'A','B','C','D','E'};
n = length(values);
x = 3;
C = nchoosek(1:n, x);
data = values(C)
In the above, each row is a unique combination of 3 of the 5 elements of values.
Alternatively pass in the values directly:
data = nchoosek(values, x);

Is there a way to make this code faster and if possible avoid loops?

A1, B1, C1, A2, B2 and C2 are 6 matrix with the same dimensions 4435X2000.
I have to find the values i, j and k for which A1(k,2000) == A2(i,j) and B1(k,2000) == B2(i,j) and C1(k,2000) == C2(i,j) , with the condition X(k)==1 and Y(i,j)==1
The objective is to find: counter, L, T and D
Is there a way to make this code faster? Can I avoid loops?
counter=0;
L(1)=0;
T(1)=0;
D(1)=0;
for k=1:4435
if X(k)==1 % X is a vector (4435x1)
F(k,:) = [A1(k,2000) B1(k,2000) C1(k,2000)]
for i=1:4435
for j=100:1999
if Y(i,j)==1 % Y is a matrix (4435x1999)
if F(k,:) == [A2(i,j) B2(i,j) C2(i,j)]
counter = counter+1;
L(counter)=k;
T(counter)=i;
D(counter)=j;
end
end
end
end
end
end
I want a solution that will save me at least 80% of the computation time!
and not have the error message: Out of memory
See how this works out for you -
%// Store X-Y data by calling X() and Y() functions
X_data = X(1:4435);
Y_data = Y(1:4435,100:1999);
range1 = 100:1999 %// define range for columns
A2 = A2(:,range1); %// Crop out A2, B2, C2 based on column-range
B2 = B2(:,range1);
C2 = C2(:,range1);
Y_data = Y_data(:,range1)==1;
%// Indices for dim-3
idx_X = find(X_data==1)
%// Map X==1 onto A1, B1, C1
A1Lr = A1(X_data==1,end)
B1Lr = B1(X_data==1,end)
C1Lr = C1(X_data==1,end)
%// Setup output array to store L, T, D as single Nx3 output array
out = zeros(sum(Y_data(:))*numel(A1Lr),3);
%// Try out(sum(Y_data(:)==1)*numel(A1Lr),3)=0; instead for speed!
%// Start collecting output indices
count = 1;
for iter1 = 1:numel(A1Lr)
[R,C] = find(Y_data & A2==A1Lr(iter1) & B2==B1Lr(iter1) & C2==C1Lr(iter1));
nR = numel(R);
out(count:count+nR-1,:) = [R C repmat(iter1,nR,1)];
count = count + nR;
end
out(find(out(:,1)==0,1):end,:)=[];
%// Packup the outputs
T = out(:,1)
D = out(:,2) + range1(1)-1
L = idx_X(out(:,3))
It is very difficult to determine what your code is actually supposed to accomplish, without really working to interpret your code. However, I'll give it a crack:
% Determine where X is true.
XTrue = X == 1;
% Extract values from A1,B1,C1 where X is true.
F ( XTrue , 1 : 3 ) = [ A1(XTrue,2000) B1(XTrue,2000) C1(XTrue,2000) ];
% Determine where Y is true.
YTrueIndex = find ( Y == 1 );
% Determine where the extracted values match
counter = [];
L = [];
T = [];
D = [];
for ( ii = 1 : length(YTrueIndex) )
indexCurrent = YTrueIndex(ii)
FRowsThatMatch = F(:,1)==A2(indexCurrent) & F(:,2)==B2(indexCurrent) & F(:,3)==C2(indexCurrent);
matchCount = length ( find ( FRowsThatMatch ) );
if ( matchCount > 0 )
counter = counter + matchCount;
[ i , j ] = ind2sub ( size ( Y ) , indexCurrent );
L = [ L , find ( FRowsThatMatch ) ];
T = [ T , ones(matchCount,1)*i ];
D = [ D , ones(matchCount,2)*j ];
end
end

how to match dna sequence pattern

I am getting a trouble finding an approach to solve this problem.
Input-output sequences are as follows :
**input1 :** aaagctgctagag
**output1 :** a3gct2ag2
**input2 :** aaaaaaagctaagctaag
**output2 :** a6agcta2ag
Input nsequence can be of 10^6 characters and largest continuous patterns will be considered.
For example for input2 "agctaagcta" output will not be "agcta2gcta" but it will be "agcta2".
Any help appreciated.
Explanation of the algorithm:
Having a sequence S with symbols s(1), s(2),…, s(N).
Let B(i) be the best compressed sequence with elements s(1), s(2),…,s(i).
So, for example, B(3) will be the best compressed sequence for s(1), s(2), s(3).
What we want to know is B(N).
To find it, we will proceed by induction. We want to calculate B(i+1), knowing B(i), B(i-1), B(i-2), …, B(1), B(0), where B(0) is empty sequence, and and B(1) = s(1). At the same time, this constitutes a proof that the solution is optimal. ;)
To calculate B(i+1), we will pick the best sequence among the candidates:
Candidate sequences where the last block has one element:
B(i )s(i+1)1
B(i-1)s(i+1)2 ; only if s(i) = s(i+1)
B(i-2)s(i+1)3 ; only if s(i-1) = s(i) and s(i) = s(i+1)
…
B(1)s(i+1)[i-1] ; only if s(2)=s(3) and s(3)=s(4) and … and s(i) = s(i+1)
B(0)s(i+1)i = s(i+1)i ; only if s(1)=s(2) and s(2)=s(3) and … and s(i) = s(i+1)
Candidate sequences where the last block has 2 elements:
B(i-1)s(i)s(i+1)1
B(i-3)s(i)s(i+1)2 ; only if s(i-2)s(i-1)=s(i)s(i+1)
B(i-5)s(i)s(i+1)3 ; only if s(i-4)s(i-3)=s(i-2)s(i-1) and s(i-2)s(i-1)=s(i)s(i+1)
…
Candidate sequences where the last block has 3 elements:
…
Candidate sequences where the last block has 4 elements:
…
…
Candidate sequences where last block has n+1 elements:
s(1)s(2)s(3)………s(i+1)
For each possibility, the algorithm stops when the sequence block is no longer repeated. And that’s it.
The algorithm will be some thing like this in psude-c code:
B(0) = “”
for (i=1; i<=N; i++) {
// Calculate all the candidates for B(i)
BestCandidate=null
for (j=1; j<=i; j++) {
Calculate all the candidates of length (i)
r=1;
do {
Candidadte = B([i-j]*r-1) s(i-j+1)…s(i-1)s(i) r
If ( (BestCandidate==null)
|| (Candidate is shorter that BestCandidate))
{
BestCandidate=Candidate.
}
r++;
} while ( ([i-j]*r <= i)
&&(s(i-j*r+1) s(i-j*r+2)…s(i-j*r+j) == s(i-j+1) s(i-j+2)…s(i-j+j))
}
B(i)=BestCandidate
}
Hope that this can help a little more.
The full C program performing the required task is given below. It runs in O(n^2). The central part is only 30 lines of code.
EDIT I have restructured a little bit the code, changed the names of the variables and added some comment in order to be more readable.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
// This struct represents a compressed segment like atg4, g3, agc1
struct Segment {
char *elements;
int nElements;
int count;
};
// As an example, for the segment agagt3 elements would be:
// {
// elements: "agagt",
// nElements: 5,
// count: 3
// }
struct Sequence {
struct Segment lastSegment;
struct Sequence *prev; // Points to a sequence without the last segment or NULL if it is the first segment
int totalLen; // Total length of the compressed sequence.
};
// as an example, for the sequence agt32ta5, the representation will be:
// {
// lastSegment:{"ta" , 2 , 5},
// prev: #A,
// totalLen: 8
// }
// and A will be
// {
// lastSegment{ "agt", 3, 32},
// prev: NULL,
// totalLen: 5
// }
// This function converts a sequence to a string.
// You have to free the string after using it.
// The strategy is to construct the string from right to left.
char *sequence2string(struct Sequence *S) {
char *Res=malloc(S->totalLen + 1);
char *digits="0123456789";
int p= S->totalLen;
Res[p]=0;
while (S!=NULL) {
// first we insert the count of the last element.
// We do digit by digit starting with the units.
int C = S->lastSegment.count;
while (C) {
p--;
Res[p] = digits[ C % 10 ];
C /= 10;
}
p -= S->lastSegment.nElements;
strncpy(Res + p , S->lastSegment.elements, S->lastSegment.nElements);
S = S ->prev;
}
return Res;
}
// Compresses a dna sequence.
// Returns a string with the in sequence compressed.
// The returned string must be freed after using it.
char *dnaCompress(char *in) {
int i,j;
int N = strlen(in);; // Number of elements of a in sequence.
// B is an array of N+1 sequences where B(i) is the best compressed sequence sequence of the first i characters.
// What we want to return is B[N];
struct Sequence *B;
B = malloc((N+1) * sizeof (struct Sequence));
// We first do an initialization for i=0
B[0].lastSegment.elements="";
B[0].lastSegment.nElements=0;
B[0].lastSegment.count=0;
B[0].prev = NULL;
B[0].totalLen=0;
// and set totalLen of all the sequences to a very HIGH VALUE in this case N*2 will be enougth, We will try different sequences and keep the minimum one.
for (i=1; i<=N; i++) B[i].totalLen = INT_MAX; // A very high value
for (i=1; i<=N; i++) {
// at this point we want to calculate B[i] and we know B[i-1], B[i-2], .... ,B[0]
for (j=1; j<=i; j++) {
// Here we will check all the candidates where the last segment has j elements
int r=1; // number of times the last segment is repeated
int rNDigits=1; // Number of digits of r
int rNDigitsBound=10; // We will increment r, so this value is when r will have an extra digit.
// when r = 0,1,...,9 => rNDigitsBound = 10
// when r = 10,11,...,99 => rNDigitsBound = 100
// when r = 100,101,.,999 => rNDigitsBound = 1000 and so on.
do {
// Here we analitze a candidate B(i).
// where the las segment has j elements repeated r times.
int CandidateLen = B[i-j*r].totalLen + j + rNDigits;
if (CandidateLen < B[i].totalLen) {
B[i].lastSegment.elements = in + i - j*r;
B[i].lastSegment.nElements = j;
B[i].lastSegment.count = r;
B[i].prev = &(B[i-j*r]);
B[i].totalLen = CandidateLen;
}
r++;
if (r == rNDigitsBound ) {
rNDigits++;
rNDigitsBound *= 10;
}
} while ( (i - j*r >= 0)
&& (strncmp(in + i -j, in + i - j*r, j)==0));
}
}
char *Res=sequence2string(&(B[N]));
free(B);
return Res;
}
int main(int argc, char** argv) {
char *compressedDNA=dnaCompress(argv[1]);
puts(compressedDNA);
free(compressedDNA);
return 0;
}
Forget Ukonnen. Dynamic programming it is. With 3-dimensional table:
sequence position
subsequence size
number of segments
TERMINOLOGY: For example, having a = "aaagctgctagag", sequence position coordinate would run from 1 to 13. At sequence position 3 (letter 'g'), having subsequence size 4, the subsequence would be "gctg". Understood? And as for the number of segments, then expressing a as "aaagctgctagag1" consists of 1 segment (the sequence itself). Expressing it as "a3gct2ag2" consists of 3 segments. "aaagctgct1ag2" consists of 2 segments. "a2a1ctg2ag2" would consist of 4 segments. Understood? Now, with this, you start filling a 3-dimensional array 13 x 13 x 13, so your time and memory complexity seems to be around n ** 3 for this. Are you sure you can handle it for million-bp sequences? I think that greedy approach would be better, because large DNA sequences are unlikely to repeat exactly. And, I would suggest that you widen your assignment to approximate matches, and you can publish it straight in a journal.
Anyway, you will start filling the table of compressing a subsequence starting at some position (dimension 1) with length equal to dimension 2 coordinate, having at most dimension 3 segments. So you first fill the first row, representing compressions of subsequences of length 1 consisting of at most 1 segment:
a a a g c t g c t a g a g
1(a1) 1(a1) 1(a1) 1(g1) 1(c1) 1(t1) 1(g1) 1(c1) 1(t1) 1(a1) 1(g1) 1(a1) 1(g1)
The number is the character cost (always 1 for these trivial 1-char sequences; number 1 does not count into the character cost), and in the parenthesis, you have the compression (also trivial for this simple case). The second row will be still simple:
2(a2) 2(a2) 2(ag1) 2(gc1) 2(ct1) 2(tg1) 2(gc1) 2(ct1) 2(ta1) 2(ag1) 2(ga1) 2(ag1)
There is only 1 way to decompose a 2-character sequence into 2 subsequences -- 1 character + 1 character. If they are identical, the result is like a + a = a2. If they are different, such as a + g, then, because only 1-segment sequences are admissible, the result cannot be a1g1, but must be ag1. The third row will be finally more interesting:
2(a3) 2(aag1) 3(agc1) 3(gct1) 3(ctg1) 3(tgc1) 3(gct1) 3(cta1) 3(tag1) 3(aga1) 3(gag1)
Here, you can always choose between 2 ways of composing the compressed string. For example, aag can be composed either as aa + g or a + ag. But again, we cannot have 2 segments, as in aa1g1 or a1ag1, so we must be satisfied with aag1, unless both components consist of the same character, as in aa + a => a3, with character cost 2. We can continue onto 4 th line:
4(aaag1) 4(aagc1) 4(agct1) 4(gctg1) 4(ctgc1) 4(tgct1) 4(gcta1) 4(ctag1) 4(taga1) 3(ag2)
Here, on the first position, we cannot use a3g1, because only 1 segment is allowed at this layer. But at the last position, compression to character cost 3 is agchieved by ag1 + ag1 = ag2. This way, one can fill the whole first-level table all the way up to the single subsequence of 13 characters, and each subsequence will have its optimal character cost and its compression under the first-level constraint of at most 1 segment associated with it.
Then you go to the 2nd level, where 2 segments are allowed... And again, from the bottom up, you identify the optimum cost and compression of each table coordinate under the given level's segment count constraint, by comparing all the possible ways to compose the subsequence using already computed positions, until you fill the table completely and thus compute the global optimum. There are some details to solve, but sorry, I'm not gonna code this for you.
After trying my own way for a while, my kudos to jbaylina for his beautiful algorithm and C implementation. Here's my attempted version of jbaylina's algorithm in Haskell, and below it further development of my attempt at a linear-time algorithm that attempts to compress segments that include repeated patterns in a one-by-one fashion:
import Data.Map (fromList, insert, size, (!))
compress s = (foldl f (fromList [(0,([],0)),(1,([s!!0],1))]) [1..n - 1]) ! n
where
n = length s
f b i = insert (size b) bestCandidate b where
add (sequence, sLength) (sequence', sLength') =
(sequence ++ sequence', sLength + sLength')
j' = [1..min 100 i]
bestCandidate = foldr combCandidates (b!i `add` ([s!!i,'1'],2)) j'
combCandidates j candidate' =
let nextCandidate' = comb 2 (b!(i - j + 1)
`add` ((take j . drop (i - j + 1) $ s) ++ "1", j + 1))
in if snd nextCandidate' <= snd candidate'
then nextCandidate'
else candidate' where
comb r candidate
| r > uBound = candidate
| not (strcmp r True) = candidate
| snd nextCandidate <= snd candidate = comb (r + 1) nextCandidate
| otherwise = comb (r + 1) candidate
where
uBound = div (i + 1) j
prev = b!(i - r * j + 1)
nextCandidate = prev `add`
((take j . drop (i - j + 1) $ s) ++ show r, j + length (show r))
strcmp 1 _ = True
strcmp num bool
| (take j . drop (i - num * j + 1) $ s)
== (take j . drop (i - (num - 1) * j + 1) $ s) =
strcmp (num - 1) True
| otherwise = False
Output:
*Main> compress "aaagctgctagag"
("a3gct2ag2",9)
*Main> compress "aaabbbaaabbbaaabbbaaabbb"
("aaabbb4",7)
Linear-time attempt:
import Data.List (sortBy)
group' xxs sAccum (chr, count)
| null xxs = if null chr
then singles
else if count <= 2
then reverse sAccum ++ multiples ++ "1"
else singles ++ if null chr then [] else chr ++ show count
| [x] == chr = group' xs sAccum (chr,count + 1)
| otherwise = if null chr
then group' xs (sAccum) ([x],1)
else if count <= 2
then group' xs (multiples ++ sAccum) ([x],1)
else singles
++ chr ++ show count ++ group' xs [] ([x],1)
where x:xs = xxs
singles = reverse sAccum ++ (if null sAccum then [] else "1")
multiples = concat (replicate count chr)
sequences ws strIndex maxSeqLen = repeated' where
half = if null . drop (2 * maxSeqLen - 1) $ ws
then div (length ws) 2 else maxSeqLen
repeated' = let (sequence,(sequenceStart, sequenceEnd'),notSinglesFlag) = repeated
in (sequence,(sequenceStart, sequenceEnd'))
repeated = foldr divide ([],(strIndex,strIndex),False) [1..half]
equalChunksOf t a = takeWhile(==t) . map (take a) . iterate (drop a)
divide chunkSize b#(sequence,(sequenceStart, sequenceEnd'),notSinglesFlag) =
let t = take (2*chunkSize) ws
t' = take chunkSize t
in if t' == drop chunkSize t
then let ts = equalChunksOf t' chunkSize ws
lenTs = length ts
sequenceEnd = strIndex + lenTs * chunkSize
newEnd = if sequenceEnd > sequenceEnd'
then sequenceEnd else sequenceEnd'
in if chunkSize > 1
then if length (group' (concat (replicate lenTs t')) [] ([],0)) > length (t' ++ show lenTs)
then (((strIndex,sequenceEnd,chunkSize,lenTs),t'):sequence, (sequenceStart,newEnd),True)
else b
else if notSinglesFlag
then b
else (((strIndex,sequenceEnd,chunkSize,lenTs),t'):sequence, (sequenceStart,newEnd),False)
else b
addOne a b
| null (fst b) = a
| null (fst a) = b
| otherwise =
let (((start,end,patLen,lenS),sequence):rest,(sStart,sEnd)) = a
(((start',end',patLen',lenS'),sequence'):rest',(sStart',sEnd')) = b
in if sStart' < sEnd && sEnd < sEnd'
then let c = ((start,end,patLen,lenS),sequence):rest
d = ((start',end',patLen',lenS'),sequence'):rest'
in (c ++ d, (sStart, sEnd'))
else a
segment xs baseIndex maxSeqLen = segment' xs baseIndex baseIndex where
segment' zzs#(z:zs) strIndex farthest
| null zs = initial
| strIndex >= farthest && strIndex > 0 = ([],(0,0))
| otherwise = addOne initial next
where
next#(s',(start',end')) = segment' zs (strIndex + 1) farthest'
farthest' | null s = farthest
| otherwise = if start /= end && end > farthest then end else farthest
initial#(s,(start,end)) = sequences zzs strIndex maxSeqLen
areExclusive ((a,b,_,_),_) ((a',b',_,_),_) = (a' >= b) || (b' <= a)
combs [] r = [r]
combs (x:xs) r
| null r = combs xs (x:r) ++ if null xs then [] else combs xs r
| otherwise = if areExclusive (head r) x
then combs xs (x:r) ++ combs xs r
else if l' > lowerBound
then combs xs (x: reduced : drop 1 r) ++ combs xs r
else combs xs r
where lowerBound = l + 2 * patLen
((l,u,patLen,lenS),s) = head r
((l',u',patLen',lenS'),s') = x
reduce = takeWhile (>=l') . iterate (\x -> x - patLen) $ u
lenReduced = length reduce
reduced = ((l,u - lenReduced * patLen,patLen,lenS - lenReduced),s)
buildString origStr sequences = buildString' origStr sequences 0 (0,"",0)
where
buildString' origStr sequences index accum#(lenC,cStr,lenOrig)
| null sequences = accum
| l /= index =
buildString' (drop l' origStr) sequences l (lenC + l' + 1, cStr ++ take l' origStr ++ "1", lenOrig + l')
| otherwise =
buildString' (drop u' origStr) rest u (lenC + length s', cStr ++ s', lenOrig + u')
where
l' = l - index
u' = u - l
s' = s ++ show lenS
(((l,u,patLen,lenS),s):rest) = sequences
compress [] _ accum = reverse accum ++ (if null accum then [] else "1")
compress zzs#(z:zs) maxSeqLen accum
| null (fst segment') = compress zs maxSeqLen (z:accum)
| (start,end) == (0,2) && not (null accum) = compress zs maxSeqLen (z:accum)
| otherwise =
reverse accum ++ (if null accum || takeWhile' compressedStr 0 /= 0 then [] else "1")
++ compressedStr
++ compress (drop lengthOriginal zzs) maxSeqLen []
where segment'#(s,(start,end)) = segment zzs 0 maxSeqLen
combinations = combs (fst $ segment') []
takeWhile' xxs count
| null xxs = 0
| x == '1' && null (reads (take 1 xs)::[(Int,String)]) = count
| not (null (reads [x]::[(Int,String)])) = 0
| otherwise = takeWhile' xs (count + 1)
where x:xs = xxs
f (lenC,cStr,lenOrig) (lenC',cStr',lenOrig') =
let g = compare ((fromIntegral lenC + if not (null accum) && takeWhile' cStr 0 == 0 then 1 else 0) / fromIntegral lenOrig)
((fromIntegral lenC' + if not (null accum) && takeWhile' cStr' 0 == 0 then 1 else 0) / fromIntegral lenOrig')
in if g == EQ
then compare (takeWhile' cStr' 0) (takeWhile' cStr 0)
else g
(lenCompressed,compressedStr,lengthOriginal) =
head $ sortBy f (map (buildString (take end zzs)) (map reverse combinations))
Output:
*Main> compress "aaaaaaaaabbbbbbbbbaaaaaaaaabbbbbbbbb" 100 []
"a9b9a9b9"
*Main> compress "aaabbbaaabbbaaabbbaaabbb" 100 []
"aaabbb4"

Permutations with order restrictions

Let L be a list of objects. Moreover, let C be a set of constraints, e.g.:
C(1) = t1 comes before t2, where t1 and t2 belong to L
C(2) = t3 comes after t2, where t3 and t2 belong to L
How can I find (in MATLAB) the set of permutations for which the constraints in C are not violated?
My first solution is naive:
orderings = perms(L);
toBeDeleted = zeros(1,size(orderings,1));
for ii = 1:size(orderings,1)
for jj = 1:size(constraints,1)
idxA = find(orderings(ii,:) == constraints(jj,1));
idxB = find(orderings(ii,:) == constraints(jj,2));
if idxA > idxB
toBeDeleted(ii) = 1;
end
end
end
where constraints is a set of constraints (each constraint is on a row of two elements, specifying that the first element comes before the second element).
I was wondering whether there exists a simpler (and more efficient) solution.
Thanks in advance.
I'd say that's a pretty good solution you have so far.
There is a few optimizations I see though. Here's my variation:
% INITIALIZE
NN = 9;
L = rand(1,NN-1);
while numel(L) ~= NN;
L = unique( randi(100,1,NN) ); end
% Some bogus constraints
constraints = [...
L(1) L(2)
L(3) L(6)
L(3) L(5)
L(8) L(4)];
% METHOD 0 (your original method)
tic
orderings = perms(L);
p = size(orderings,1);
c = size(constraints,1);
toKeep = true(p,1);
for perm = 1:p
for constr = 1:c
idxA = find(orderings(perm,:) == constraints(constr,1));
idxB = find(orderings(perm,:) == constraints(constr,2));
if idxA > idxB
toKeep(perm) = false;
end
end
end
orderings0 = orderings(toKeep,:);
toc
% METHOD 1 (your original, plus a few optimizations)
tic
orderings = perms(L);
p = size(orderings,1);
c = size(constraints,1);
toKeep = true(p,1);
for perm = 1:p
for constr = 1:c
% break on first condition breached
if toKeep(perm)
% find only *first* entry
toKeep(perm) = ...
find(orderings(perm,:) == constraints(constr,1), 1) < ...
find(orderings(perm,:) == constraints(constr,2), 1);
else
break
end
end
end
orderings1 = orderings(toKeep,:);
toc
% METHOD 2
tic
orderings = perms(L);
p = size(orderings,1);
c = size(constraints,1);
toKeep = true(p,1);
for constr = 1:c
% break on first condition breached1
if any(toKeep)
% Vectorized search for constraint values
[i1, j1] = find(orderings == constraints(constr,1));
[i2, j2] = find(orderings == constraints(constr,2));
% sort by rows
[i1, j1i] = sort(i1);
[i2, j2i] = sort(i2);
% Check if columns meet condition
toKeep = toKeep & j1(j1i) < j2(j2i);
else
break
end
end
orderings2 = orderings(toKeep,:);
toc
% Check for equality
all(orderings2(:) == orderings1(:))
Results:
Elapsed time is 17.911469 seconds. % your method
Elapsed time is 10.477549 seconds. % your method + optimizations
Elapsed time is 2.184242 seconds. % vectorized outer loop
ans =
1
ans =
1
The whole approach however has one fundamental flaw IMHO; the direct use of perms. This inherently poses a limitation due to memory constraints (NN < 10, as stated in help perms).
I have a strong suspicion you can get better performance, both time-wise and memory-wise, when you put together a customized perms. Luckily, perms is not built-in, so you can start by copy-pasting that code into your custom function.

Strange Ruby String Selection

The string in question (read from a file):
if (true) then
{
_this = createVehicle ["Land_hut10", [6226.8901, 986.091, 4.5776367e-005], [], 0, "CAN_COLLIDE"];
_vehicle_10 = _this;
_this setDir -2.109278;
};
Retrieved from a large list of similar (all same file) strings via the following:
get_stringR(string,"if","};")
And the function code:
def get_stringR(a,b,c)
b = a.index(b)
b ||= 0
c = a.rindex(c)
c ||= b
r = a[b,c]
return r
end
As so far, this works fine, but what I wanted to do is select the array after "createVehicle", the following (I thought) should work.
newstring = get_string(myString,"\[","\];")
Note get_string is the same as get_stringR, except it uses the first occurrence of the pattern both times, rather then the first and last occurrence.
The output should have been: ["Land_hut10", [6226.8901, 986.091, 4.5776367e-005], [], 0, "CAN_COLLIDE"];
Instead it was the below, given via 'puts':
["Land_hut10", [6226.8901, 986.091, 4.5776367e-005], [], 0, "CAN_COLLIDE"];
_vehicle_10 = _this;
_this setDir
Some 40 characters past the point it should have retrieve, which was very strange...
Second note, using both get_string and get_stringR produced the exact same result with the parameters given.
I then decided to add the following to my get_string code:
b = a.index(b)
b ||= 0
c = a.index(c)
c ||= b
if c > 40 then
c -= 40
end
r = a[b,c]
return r
And it works as expected (for every 'block' in the file, even though the strings after that array are not identical in any way), but something obviously isn't right :).
You want r = a[b..c] instead of r = a[b,c].
Difference is: b..c = start from b, go to c, while b,c = start from b and move c characters to the right.
Edit: You don't have to/shouldn't escape the [ and ] either, because you are using strings and not regexen. Also, you have to take the length of the end ("];") into consideration, or you will cut off parts of the end.
def get_stringR(a,b,c)
bp = a.index(b) || 0
cp = a.rindex(c) || bp
r = a[bp..cp + c.size - 1]
return r
end
def get_string(a,b,c)
bp = a.index(b) || 0
cp = a.index(c) || bp
r = a[bp..cp + c.size - 1]
return r
end

Resources