Why is this perl script so slow? - performance

I have written a perl script to implement in the polymake framework.
The script accomplishes the following task. It's input is a matrix $M from polymake. Issuing
print ref($M);
returns
Polymake::common::Matrix_A_Rational_I_NonSymmetric_Z
The output of the script is a matrix whose rows are the rows obtained from M by taking all the sums of all subsets of rows of M. For example, the input of
1 0 0
1 3 0
1 0 3
returns
0 0 0
3 3 3
2 3 3
2 0 3
1 0 3
2 3 0
1 3 0
1 0 0
The code for my script is
use application 'polytope';
use Data::PowerSet 'powerset';
sub newVertices {
my ($orig) = #_;
my $n = $orig -> rows;
my $vert = zero_vector($n);
my $d = Data::PowerSet->new( {min => 1}, map {"$_"} (0..$n-1) );
while (my $r = $d -> next) {
my $v = new Vector($n);
foreach (#$r) {
$v += #$orig[$_];
}
$vert = $vert / $v;
}
return $vert;
}
The script works fine. My issue is that the script runs a lot slower than I think it should. The example I mention above takes about half a minute to run. My questions are
Am I correct to expect that such a task should be able to be accomplished quickly for matrices with relatively few rows?
Why is this script so slow?
Is there a way to accomplish this task more efficiently?
I suspect the problem might be the matrix object type I am dealing with it and maybe writing a script that does this purely with arrays might be more efficient. My problem is I cannot figure out how to transition from an array back to a matrix so I haven't made an attempt to do something along these lines.

Related

Subsetting Data with GREP

I have a very large text file (16GB) that I want to subset as fast as possible.
Here is a sample of the data involved
0 M 4 0
0 0 Q 0 10047345 3080290,4098689 50504886,4217515 9848058,1084315 50534229,4217515 50591618,4217515 26242582,2597528 34623075,3279130 68893581,5149883 50628761,4217517 32262001,3142702 35443881,3339757
0 108 C 0 50628761
0 1080 C 0 50628761
1 M 7 0
1 0 Q 0 17143989
2 M 15 1
2 0 Q 0 17143989 4219157,1841361,853923,1720163,1912374,1755325,4454730 65548702,4975721 197782,39086 54375043,4396765 31589696,3091097 6876504,851594 3374640,455375 13274885,1354902 31585771,3091016 61234218,4723345 31583582,3091014
2 27 C 0 31589696
The first number on every line is a sessionID and any line with an 'M' denotes the start of a session (data is grouped by session). The number following an M is a Day and the second number is a userID, a user can have multiple sessions.
I want to extract all lines related to a specific user which for each session include all of the lines up until the next 'M' line is encountered (can be any number of lines). As a second task I also want to extract all session lines related to a specific day.
For example with the above data, to extract the records for userid '0' the output would be:
0 M 4 0
0 0 Q 0 10047345 3080290,4098689 50504886,4217515 9848058,1084315 50534229,4217515 50591618,4217515 26242582,2597528 34623075,3279130 68893581,5149883 50628761,4217517 32262001,3142702 35443881,3339757
0 108 C 0 50628761
0 1080 C 0 50628761
1 M 7 0
1 0 Q 0 17143989
To extract the records for day 7 the output would be:
1 M 7 0
1 0 Q 0 17143989
I believe there is a much more elegant and simple solution to what I have achieved so far and it would be great to get some feedback and suggestions. Thank you.
What I have tried
I tried to use pcrgrep -M to apply this pattern directly (matching data between two M's) but struggled to get this working across the linebreaks. I still suspect this may be the fastest option so any guidance on whether this may be possible would be great.
The next part is quite scattered and it is not necessary to read on if you already have an idea for a better solution!
Failing the above, I split the problem into two parts:
Part 1: Isolating all 'M' lines to obtain a list of sessions which belonging to that user/day
grep method is fast (then need to figure out how to use this data)
time grep -c "M\t.*\t$user_id" trainSample.txt >> sessions.txt
awk method to create an array is slow
time myarr=$(awk '/M\t.*\t$user_id/ {print $1}' trainSample.txt
Part 2: Extracting all lines belonging to a session on the list created in part 1
Continuing from the awk method, I ran grep for each but this is WAY too slow (days to complete 16GB)
for i in "${!myarr[#]}";
do
grep "^${myarr[$i]}\t" trainSample.txt >> sessions.txt
echo -ne "Session $i\r"
done
Instead of running grep once per session ID as above using them all in the one grep command is MUCH faster (I ran it with 8 sessionIDs in a [1|2|3|..|8] format and it took the same time as each did separately i.e. 8X faster). However I need then to figure out how to do this dynamically
Update
I have actually established a working solution which only takes seconds to complete but it is some messy and inflexible bash coe which I have yet to extend to the second (isolating by days) case.
I want to extract all lines related to a specific user which for each session include all of the lines up until the next 'M' line is encountered (can be any number of lines).
$ awk '$2=="M"{p=$4==0}p' file
0 M 4 0
0 0 Q 0 10047345 3080290,4098689 50504886,4217515 9848058,1084315 50534229,4217515 50591618,4217515 26242582,2597528 34623075,3279130 68893581,5149883 50628761,4217517 32262001,3142702 35443881,3339757
0 108 C 0 50628761
0 1080 C 0 50628761
1 M 7 0
1 0 Q 0 17143989
As a second task I also want to extract all session lines related to a specific day.
$ awk '$2=="M"{p=$3==7}p' file
1 M 7 0
1 0 Q 0 17143989

All possible N choose K WITHOUT recusion

I'm trying to create a function that is able to go through a row vector and output the possible combinations of an n choose k without recursion.
For example: 3 choose 2 on [a,b,c] outputs [a,b; a,c; b,c]
I found this: How to loop through all the combinations of e.g. 48 choose 5 which shows how to do it for a fixed n choose k and this: https://codereview.stackexchange.com/questions/7001/generating-all-combinations-of-an-array which shows how to get all possible combinations. Using the latter code, I managed to make a very simple and inefficient function in matlab which returned the result:
function [ combi ] = NCK(x,k)
%x - row vector of inputs
%k - number of elements in the combinations
combi = [];
letLen = 2^length(x);
for i = 0:letLen-1
temp=[0];
a=1;
for j=0:length(x)-1
if (bitand(i,2^j))
temp(k) = x(j+1);
a=a+1;
end
end
if (nnz(temp) == k)
combi=[combi; derp];
end
end
combi = sortrows(combi);
end
This works well for very small vectors, but I need this to be able to work with vectors of at least 50 in length. I've found many examples of how to do this recursively, but is there an efficient way to do this without recursion and still be able to do variable sized vectors and ks?
Here's a simple function that will take a permutation of k ones and n-k zeros and return the next combination of nchoosek. It's completely independent of the values of n and k, taking the values directly from the input array.
function [nextc] = nextComb(oldc)
nextc = [];
o = find(oldc, 1); %// find the first one
z = find(~oldc(o+1:end), 1) + o; %// find the first zero *after* the first one
if length(z) > 0
nextc = oldc;
nextc(1:z-1) = 0;
nextc(z) = 1; %// make the first zero a one
nextc(1:nnz(oldc(1:z-2))) = 1; %// move previous ones to the beginning
else
nextc = zeros(size(oldc));
nextc(1:nnz(oldc)) = 1; %// start over
end
end
(Note that the else clause is only necessary if you want the combinations to wrap around from the last combination to the first.)
If you call this function with, for example:
A = [1 1 1 1 1 0 1 0 0 1 1]
nextCombination = nextComb(A)
the output will be:
A =
1 1 1 1 1 0 1 0 0 1 1
nextCombination =
1 1 1 1 0 1 1 0 0 1 1
You can then use this as a mask into your alphabet (or whatever elements you want combinations of).
C = ['a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k']
C(find(nextCombination))
ans = abcdegjk
The first combination in this ordering is
1 1 1 1 1 1 1 1 0 0 0
and the last is
0 0 0 1 1 1 1 1 1 1 1
To generate the first combination programatically,
n = 11; k = 8;
nextCombination = zeros(1,n);
nextCombination(1:k) = 1;
Now you can iterate through the combinations (or however many you're willing to wait for):
for c = 2:nchoosek(n,k) %// start from 2; we already have 1
nextCombination = nextComb(A);
%// do something with the combination...
end
For your example above:
nextCombination = [1 1 0];
C(find(nextCombination))
for c = 2:nchoosek(3,2)
nextCombination = nextComb(nextCombination);
C(find(nextCombination))
end
ans = ab
ans = ac
ans = bc
Note: I've updated the code; I had forgotten to include the line to move all of the 1's that occur prior to the swapped digits to the beginning of the array. The current code (in addition to being corrected above) is on ideone here. Output for 4 choose 2 is:
allCombs =
1 2
1 3
2 3
1 4
2 4
3 4

Variable format

I wrote a program to calculate a square finite difference matrix, where you can enter the number of rows (equals the number of columns) -> this is stored in the variable matrix. The program works fine:
program fin_diff_matrix
implicit none
integer, dimension(:,:), allocatable :: A
integer :: matrix,i,j
print *,'Enter elements:'
read *, matrix
allocate(A(matrix,matrix))
A = 0
A(1,1) = 2
A(1,2) = -1
A(matrix,matrix) = 2
A(matrix,matrix-1) = -1
do j=2,matrix-1
A(j,j-1) = -1
A(j,j) = 2
A(j,j+1) = -1
end do
print *, 'Matrix A: '
write(*,1) A
1 format(6i10)
end program fin_diff_matrix
For the output I want that matrix is formatted for the output, e.g. if the user enters 6 rows the output should also look like:
2 -1 0 0 0 0
-1 2 -1 0 0 0
0 -1 2 -1 0 0
0 0 -1 2 -1 0
0 0 0 -1 2 -1
0 0 0 0 -1 2
The output of the format should also be variable, e.g. if the user enters 10, the output should also be formatted in 10 columns. Research on the Internet gave the following solution for the format statement with angle brackets:
1 format(<matrix>i<10)
If I compile with gfortran in Linux I always get the following error in the terminal:
fin_diff_matrix.f95:37.12:
1 format(<matrix>i10)
1
Error: Unexpected element '<' in format string at (1)
fin_diff_matrix.f95:35.11:
write(*,1) A
1
Error: FORMAT label 1 at (1) not defined
What doesn't that work and what is my mistake?
The syntax you are trying to use is non-standard, it works only in some compilers and I discourage using it.
Also, forget the FORMAT() statements for good, they are obsolete.
You can get your own number inside the format string when you construct it yourself from several parts
character(80) :: form
form = '( (i10,1x))'
write(form(2:11),'(i10)') matrix
write(*,form) A
You can also write your matrix in a loop per row and then you can use an arbitrarily large count number or a * in Fortran 2008.
do i = 1, matrix
write(*,'(999(i10,1x))') A(:,i)
end do
do i = 1, matrix
write(*,'(*(i10,1x))') A
end do
Just check if I did not transpose the matrix inadvertently.

Bash/Nawk whitespace problems

I have 100 datafiles, each with 1000 rows, and they all look something like this:
0 0 0 0
1 0 1 0
2 0 1 -1
3 0 1 -2
4 1 1 -2
5 1 1 -3
6 1 0 -3
7 2 0 -3
8 2 0 -4
9 3 0 -4
10 4 0 -4
.
.
.
999 1 47 -21
1000 2 47 -21
I have developed a script which is supposed to take the square of each value in columns 2,3,4, and then sum and square root them.
Like so:
temp = ($t1*$t1) + ($t2*$t2) + ($t3*$t3)
calc = $calc + sqrt ($temp)
It then calculates the square of that value, and averages these numbers over every data-file to output the average "calc" for each row and average "fluc" for each row.
The meaning of these numbers is this:
The first number is the step number, the next three are coordinates on the x, y and z axis respectively. I am trying to find the distance the "steps" have taken me from the origin, this is calculated with the formula r = sqrt(x^2 + y^2 + z^2). Next I need the fluctuation of r, which is calculated as f = r^4 or f = (r^2)^2.
These must be averages over the 100 data files, which leads me to:
r = r + sqrt(x^2 + y^2 + z^2)
avg = r/s
and similarly for f where s is the number of read data files which I figure out using sum=$(ls -l *.data | wc -l).
Finally, my last calculation is the deviation between the expected r and the average r, which is calculated as stddev = sqrt(fluc - (r^2)^2) outside of the loop using final values.
The script I created is:
#!/bin/bash
sum=$(ls -l *.data | wc -l)
paste -d"\t" *.data | nawk -v s="$sum" '{
for(i=0;i<=s-1;i++)
{
t1 = 2+(i*4)
t2 = 3+(i*4)
t3 = 4+(i*4)
temp = ($t1*$t1) + ($t2*$t2) + ($t3*$t3)
calc = $calc + sqrt ($temp)
fluc = $fluc + ($calc*$calc)
}
stddev = sqrt(($calc^2) - ($fluc))
print $1" "calc/s" "fluc/s" "stddev
temp=0
calc=0
stddev=0
}'
Unfortunately, part way through I receive an error:
nawk: cmd. line:9: (FILENAME=- FNR=3) fatal: attempt to access field -1
I am not experienced enough with awk to be able to figure out exactly where I am going wrong, could someone point me in the right direction or give me a better script?
The expected output is one file with:
0 0 0 0
1 (calc for all 1's) (fluc for all 1's) (stddev for all 1's)
2 (calc for all 2's) (fluc for all 2's) (stddev for all 2's)
.
.
.
The following script should do what you want. The only thing that might not work yet is the choice of delimiters. In your original script you seem to have tabs. My solution assumes spaces. But changing that should not be a problem.
It simply pipes all files sequentially into the nawk without counting the files first. I understand that this is not required. Instead of trying to keep track of positions in the file it uses arrays to store seperate statistical data for each step. In the end it iterates over all step indexes found and outputs them. Since the iteration is not sorted there is another pipe into a Unix sort call which handles this.
#!/bin/bash
# pipe the data of all files into the nawk processor
cat *.data | nawk '
BEGIN {
FS=" " # set the delimiter for the columns
}
{
step = $1 # step is in column 1
temp = $2*$2 + $3*$3 + $4*$4
# use arrays indexed by step to store data
calc[step] = calc[step] + sqrt (temp)
fluc[step] = fluc[step] + calc[step]*calc[step]
count[step] = count[step] + 1 # count the number of samples seen for a step
}
END {
# iterate over all existing steps (this is not sorted!)
for (i in count) {
stddev = sqrt((calc[i] * calc[i]) + (fluc[i] * fluc[i]))
print i" "calc[i]/count[i]" "fluc[i]/count[i]" "stddev
}
}' | sort -n -k 1 # that' why we sort here: first column "-k 1" and numerically "-n"
EDIT
As sugested by #edmorton awk can take care of loading the files itself. The following enhanced version removes the call to cat and instead passes the file pattern as parameter to nawk. Also, as suggested by #NictraSavios the new version introduces a special handling for the output of the statistics of the last step. Note that the gathering of the statistics is still done for all steps. It's a little difficult to suppress this during the reading of the data since at that point we don't know yet what the last step will be. Although this can be done with some extra effort you would probably loose a lot of robustness of your data handling since right now the script does not make any assumptions about:
the number of files provided,
the order of the files processed,
the number of steps in each file,
the order of the steps in a file,
the completeness of steps as a range without "holes".
Enhanced script:
#!/bin/bash
nawk '
BEGIN {
FS=" " # set the delimiter for the columns (not really required for space which is the default)
maxstep = -1
}
{
step = $1 # step is in column 1
temp = $2*$2 + $3*$3 + $4*$4
# remember maximum step for selected output
if (step > maxstep)
maxstep = step
# use arrays indexed by step to store data
calc[step] = calc[step] + sqrt (temp)
fluc[step] = fluc[step] + calc[step]*calc[step]
count[step] = count[step] + 1 # count the number of samples seen for a step
}
END {
# iterate over all existing steps (this is not sorted!)
for (i in count) {
stddev = sqrt((calc[i] * calc[i]) + (fluc[i] * fluc[i]))
if (i == maxstep)
# handle the last step in a special way
print i" "calc[i]/count[i]" "fluc[i]/count[i]" "stddev
else
# this is the normal handling
print i" "calc[i]/count[i]
}
}' *.data | sort -n -k 1 # that' why we sort here: first column "-k 1" and numerically "-n"
You could also use:
awk -f c.awk *.data
where c.awk is
{
j=FNR
temp=$2*$2+$3*$3+$4*$4
calc[j]=calc[j]+sqrt(temp)
fluc[j]=fluc[j]+calc[j]*calc[j]
}
END {
N=ARGIND
for (i=1; i<=FNR; i++) {
stdev=sqrt(fluc[i]-calc[i]*calc[i])
print i-1,calc[i]/N,fluc[i]/N,stdev
}
}

Form a Matrix From a Large Text File Quickly

Hi I am struggling with reading data from a file quickly enough. ( Currently left for 4hrs, then crashed) must be a simpler way.
The text file looks similar like this:
From To
1 5
3 2
2 1
4 3
From this I want to form a matrix so that there is a 1 in the according [m,n]
The current code is:
function [z] = reed (A)
[m,n]=size(A);
i=1;
while (i <= n)
z(A(1,i),A(2,i))=1;
i=i+1;
end
Which output the following matrix, z:
z =
0 0 0 0 1
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
My actual file has 280,000,000 links to and from, this code is too slow for this size file. Does anybody know a much faster was to do this in matlab?
thanks
You can do something along the lines of the following:
>> A = zeros(4,5);
>> B = importdata('testcase.txt');
>> A(sub2ind(size(A),B.data(:,1),B.data(:,2))) = 1;
My test case, 'testcase.txt' contains your sample data:
From To
1 5
3 2
2 1
4 3
The result would be:
>> A
A =
0 0 0 0 1
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
EDIT - 1
After taking a look at your data, it seems that even if you modify this code appropriately, you may not have enough memory to execute it as the matrix A would become too large.
As such, you can use sparse matrices to achieve the same as given below:
>> B = importdata('web-Stanford.txt');
>> A = sparse(B.data(:,1),B.data(:,2),1,max(max(B.data)),max(max(B.data)));
This would be the approach I'd recommend as your A matrix will have a size of [281903,281903] which would usually be too large to handle due to memory constraints. A sparse matrix on the other hand, maintains only those matrix entries which are non-zero, thus saving on a lot of space. In most cases, you can use sparse matrices more-or-less as you use normal matrices.
More information about the sparse command is given here.
EDIT - 2
I'm not sure why it isn't working for you. Here's a screenshot of how I did it in case that helps:
EDIT - 3
It seems that you're getting a double matrix in B while I'm getting a struct. I'm not sure why this is happening; I can only speculate that you deleted the header lines from the input file before you used importdata.
Basically it's just that my B.data is the same as your B. As such, you should be able to use the following instead:
>> A = sparse(B(:,1),B(:,2),1,max(max(B)),max(max(B)));

Resources