Speed up code to compare fields in a struct - performance

I have the struct Trajectories with field uniqueDate, dateAll, label: I want to compare the fields uniqueDate and dateAll and, if there is a correspondence, I will save in label a value from an other struct.
I have written this code:
for k=1:nCols
for j=1:size(Trajectories(1,k).dateAll,1)
for i=1:size(Trajectories(1,k).uniqueDate,1)
if (~isempty(s(1,k).places))&&(Trajectories(1,k).dateAll(j,1)==Trajectories(1,k).uniqueDate(i,1))&&(Trajectories(1,k).dateAll(j,2)==Trajectories(1,k).uniqueDate(i,2))&&(Trajectories(1,k).dateAll(j,3)==Trajectories(1,k).uniqueDate(i,3))
for z=1:24
if(Trajectories(1,k).dateAll(j,4)==z)&&(size(s(1,k).places.all,2)>=size(Trajectories(1,k).uniqueDate,1))
Trajectories(1,k).label(j)=s(1,k).places.all(z,i);
else if(Trajectories(1,k).dateAll(j,4)==z)&&(size(s(1,k).places.all,2)<size(Trajectories(1,k).uniqueDate,1))
for l=1:size(s(1,k).places.all,2)
Trajectories(1,k).label(l)=s(1,k).places.all(z,l);
end
end
end
end
end
end
end
end
E.g
Trajectories(1,4).dateAll=[1 2004 8 1 14 1 15 0 0 0 1 42 13 2;596 2004 8 1 16 20 14 0 0 0 1 29 12 NaN;674 2004 8 1 18 26 11 0 0 0 1 20 38 1;674 2004 8 2 10 7 40 0 0 0 14 26 5 3;674 2004 8 2 11 3 29 0 0 0 1 54 3 3;631 2004 8 2 11 57 56 0 0 0 0 30 8 2;1 2004 8 2 12 4 35 0 0 0 1 53 21 2;631 2004 8 2 12 52 58 0 0 0 0 20 36 2;631 2004 8 2 13 5 3 0 0 0 1 49 40 2;631 2004 8 2 14 0 20 0 0 0 1 56 12 2;631 2004 8 2 15 2 0 0 0 0 1 57 39 2;631 2004 8 2 16 1 4 0 0 0 1 55 53 2;1 2004 8 2 17 9 15 0 0 0 1 48 41 2];
Trajectories(1,4).uniqueDate= [2004 8 1;2004 8 2;2004 8 3;2004 8 4];
it runs but it's very very slow. How can I modify it to speed up?

Let's work from the inside out and see where it gets us.
Step 1: Simplify your comparison condition:
if (~isempty(s(1,k).places))&&(Trajectories(1,k).dateAll(j,1)==Trajectories(1,k).uniqueDate(i,1))&&(Trajectories(1,k).dateAll(j,2)==Trajectories(1,k).uniqueDate(i,2))&&(Trajectories(1,k).dateAll(j,3)==Trajectories(1,k).uniqueDate(i,3))
becomes
if (~isempty(s(1,k).places)) && all( Trajectories(1,k).dateAll(j,1:3)==Trajectories(1,k).uniqueDate(i,1:3) )
Then we want to remove this from a for-loop. The "intersect" function is useful here:
[ia i1 i2]=intersect(Trajectories(1,k).dateAll(:,1:3),Trajectories(1,k).uniqueDate(:,1:3),'rows');
We now have a vector i1 of all rows in dateAll that intersect with uniqueDate.
Now we can remove the loop comparing z using a similar approach:
[iz iz1 iz2] = intersect(Trajectories(1,k).dateAll(i1,4),1:24);
We have to be careful about our indices here, using a subset of a subset.
This simplifies the code to:
for k=1:nCols
if isempty(s(1,k).places)
continue; % skip to the next value of k, no need to do the rest of the comparison
end
[ia i1 i2]=intersect(Trajectories(1,k).dateAll(:,1:3),Trajectories(1,k).uniqueDate(:,1:3),'rows');
[iz iz1 iz2] = intersect(Trajectories(1,k).dateAll(i1,4),1:24);
usescalarlabel = (size(s(1,k).places.all,2)>=size(Trajectories(1,k).uniqueDate,1);
if (usescalarlabel)
Trajectories(1,k).label(i1(iz1)) = s(1,k).places.all(iz,i2(iz1));
else
% you will need to check this: I think here you were needlessly repeating this step for every match
Trajectories(1,k).label(i1(iz1)) = s(1,k).places.all(iz,:);
end
end
But wait! That z loop is exactly the same as using indexing. So we don't need that second intersect after all:
for k=1:nCols
if isempty(s(1,k).places)
continue; % skip to the next value of k, no need to do the rest of the comparison
end
[ia i1 i2]=intersect(Trajectories(1,k).dateAll(:,1:3),Trajectories(1,k).uniqueDate(:,1:3),'rows');
usescalarlabel = (size(s(1,k).places.all,2)>=size(Trajectories(1,k).uniqueDate,1);
label_indices = Trajectories(1,k).dateAll(i1,4);
if (usescalarlabel)
Trajectories(1,k).label(label_indices) = s(1,k).places.all(label_indices,i2);
else
% you will need to check this: I think here you were needlessly repeating this step for every match
Trajectories(1,k).label(label_indices) = s(1,k).places.all(label_indices,:);
end
end
You'll need to check the indexing in this - I'm sure I've made a mistake somewhere without having data to test against, but that should give you an idea on how to proceed removing the loops and using vector expressions instead. Without seeing the data that's as far as I can optimise. You may be able to go further if you can reformat your data into a set of 3d matrices / cells instead of using structs.
I am suspicious of your condition which I have called "usescalarlabel" - it seems like you are mixing two data types. Also I would strongly recommend separating the dateAll matrices into separate "date" and "data" matrices as the row indices 4 onwards don't seem to be dates. Also the example you copy/pasted in seems to have an extra value at row index 1? In that case you'll need to compare Trajectories(1,k).dateAll(:,2:4) instead of Trajectories(1,k).dateAll(:,1:3).
Good luck.

Related

How do I write a DXF code for a POLYLINE?

Hello I´m trying to create a program based on C++ that calculates a function values on a given range and then the program proceeds to create a DXF file in order for it to be Graphed.
The issue that I´m having it´s with the DXF part this is the code that my C++ program generates but it seems to be unable to be read by Autocad. Any insights on the issue will be much appreciated.
0
SECTION
2
ENTITIES
0
POLYLINE
8
0
62
1
66
1
70
8
0
VERTEX
8
0
70
32
10
1
20
2
30
0
0
VERTEX
8
0
70
32
10
1.2
20
2.13688
30
0
0
VERTEX
8
0
70
32
10
1.4
20
2.28024
30
0
0
VERTEX
8
0
70
32
10
1.6
20
2.42929
30
0
0
VERTEX
8
0
70
32
10
1.8
20
2.58329
30
0
0
VERTEX
8
0
70
32
10
2
20
2.74166
30
0
0
91
0
0
SEQEND
0
ENDSEC
0
EOF
There is an error in the last VERTEX:
0
VERTEX
8
0
70
32
10
2
20
2.74166
30
0
0 <---- This 0 is too much, starts a structural group tag (0, 91)
91
0
0
SEQEND
0
ENDSEC
0
EOF
If you have any information what the group code 91 (vertex identifier) is for, let me know, I am very interested.
The Issue that I was seem to be having it´s that I was using the DXF codes for a LWPOLYLINE when I should be using the DXF regarding POLYLINE. The difference is subtle but if the person that´s reading this is having the issue backtrack one by one the GROUP CODES and make sure all of them are part of the same ENTITY. I will share the code that finally was able to create an OUTPUT on AutoCad 2018 (Keep in mind the changes on the DXF format on the versions of AutoCad depending on your case)
0
SECTION
2
ENTITIES
0
POLYLINE
8
0
62
1
66
1
70
8
0
VERTEX
8
0
70
32
10
0
20
0
30
0
0
SEQEND
0
ENDSEC
0
EOF

The groupby function didn't work on two dimensional array in laravel 5.5

id p_id approve m_approve
1 75 1 0
2 74 1 1
3 73 1 1
4 72 1 1
5 75 1 1
6 73 0 1
7 71 1 0
8 70 1 1
9 69 0 1
10 75 0 0
11 75 0 0
12 75 0 0
13 75 1 0
14 75 1 0
15 75 0 1
$result = DB::table('a16s_likes')
->select ('id','p_id','approve','m_approve')
->get() ///become collection
->groupBy('p_id')
->toarray(); //->all()
echo '<pre>' ;
print_r($result);
I got the right one dimentional array
But when I use
->groupBy('p_id','approve')
->all();
I got the same one dimensional array result?
not got the two dimentional array?
How can I get p_id(75)-approve(0) and p_id(75)-approve(1) two group and just take last 2 rows?
I fix the code
->groupBy(['p_id','m_approve'])
got
According to the DOCS
Multiple grouping criteria may be passed as an array. Each array element will be applied to the corresponding level within a multi-dimensional array
Which means that your code must be ->groupBy(['p_id','approve'])

How can you improve computation time when predicting KNN Imputation?

I feel like my run time is extremely slow for my data set, this is the code:
library(caret)
library(data.table)
knnImputeValues <- preProcess(mainData[trainingRows, imputeColumns], method = c("zv", "knnImpute"))
knnTransformed <- predict(knnImputeValues, mainData[ 1:1000, imputeColumns])
the PreProcess into knnImputeValues run's fairly quickly, however the predict function takes a tremendous amount of time. When I calculated it on a subset of the data this was the result:
testtime <- system.time(knnTransformed <- predict(knnImputeValues, mainData[ 1:15000, imputeColumns
testtime
user 969.78
system 38.70
elapsed 1010.72
Additionally, it should be noted that caret preprocess uses "RANN".
Now my full dataset is:
str(mainData[ , imputeColumns])
'data.frame': 1809032 obs. of 16 variables:
$ V1: int 3 5 5 4 4 4 3 4 3 3 ...
$ V2: Factor w/ 3 levels "1000000","1500000",..: 1 1 3 1 1 1 1 3 1 1 ...
$ V3: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ V4: int 2 5 5 12 4 5 11 8 7 8 ...
$ V5: int 2 0 0 2 0 0 1 3 2 8 ...
$ V6: int 648 489 489 472 472 472 497 642 696 696 ...
$ V7: Factor w/ 4 levels "","N","U","Y": 4 1 1 1 1 1 1 1 1 1 ...
$ V8: int 0 0 0 0 0 0 0 1 1 1 ...
$ V9: num 0 0 0 0 0 ...
$ V10: Factor w/ 56 levels "1","2","3","4",..: 45 19 19 19 19 19 19 46 46 46 ...
$ V11: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ V12: num 2 5 5 12 4 5 11 8 7 8 ...
$ V13: num 2 0 0 2 0 0 1 3 2 8 ...
$ V14: Factor w/ 4 levels "1","2","3","4": 2 2 2 2 2 2 2 2 3 3 ...
$ V15: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 2 ...
$ V16: num 657 756 756 756 756 ...
So is there something I'm doing wrong, or is this typical for how long it will take to run this? If you back of the envelop extrapolate (which I know isn't entire accurate) you'd get what 33 days?
Also it looks like system time is very low and user time is very high, is that normal?
My computer is a laptop, with a Intel(R) Core(TM) i5-6300U CPU # 2.40Ghz processor.
Additionally would this improve the runtime of the predict function?
cl <- makeCluster(4)
registerDoParallel()
I tried it, and it didn't seem to make a difference other than all the processors looked more active in my task manager.
FOCUSED QUESTION: I'm using Caret package to do KNN Imputation on 1.8 Million Rows, the way I'm currently doing it will take over a month to run, how do I write this in such a way that I could do it in a much faster amount of time(if possible)?
Thank you for any help provided. And the answer might very well be "that's how long it takes don't bother" I just want to rule out any possible mistakes.
You can speed this up via the imputation package and use of canopies which can be installed from Github:
Sys.setenv("PKG_CXXFLAGS"="-std=c++0x")
devtools::install_github("alexwhitworth/imputation")
Canopies use a cheap distance metric--in this case distance from the data mean vector--to get approximate neighbors. In general, we wish to keep the canopies each sized < 100k so for 1.8M rows, we'll use 20 canopies:
library("imputation")
to_impute <- mainData[trainingRows, imputeColumns] ## OP undefined
imputed <- kNN_impute(to_impute, k= 10, q= 2, verbose= TRUE,
parallel= TRUE, n_canopies= 20)
NOTE:
The imputation package requires numeric data inputs. You have several factor variables in your str output. They will cause this to fail.
You'll also get some mean vector imputation if you have fulling missing rows.
# note this example data is too small for canopies to be useful
# meant solely to illustrate
set.seed(2143L)
x1 <- matrix(rnorm(1000), 100, 10)
x1[sample(1:1000, size= 50, replace= FALSE)] <- NA
x_imp <- kNN_impute(x1, k=5, q=2, n_canopies= 10)
sum(is.na(x_imp[[1]])) # 0
# with fully missing rows
x2 <- x1; x2[5,] <- NA
x_imp <- kNN_impute(x2, k=5, q=2, n_canopies= 10)
[1] "Computing canopies kNN solution provided within canopies"
[1] "Canopies complete... calculating kNN."
row(s) 1 are entirely missing.
These row(s)' values will be imputed to column means.
Warning message:
In FUN(X[[i]], ...) :
Rows with entirely missing values imputed to column means.

Dynamic Programming - Two spies at the river

I think this is a very complicated dynamic programming problem.
Two spies each have a secret number in [1..m]. To exchange numbers they agree to meet at the river and "innocently" take turns throwing stones: from a pile of n=26 identical stones, each spy in turn throws at least one stone in the river.
The only information is in the number of stones each thrown in each turn. What is the largest m can be so they are sure they can complete the exchange?
Develop a recursive formula to count. Here is the start of the table; complete it to n=26. (You should not expect a closed form.)
n 1 2 3 4 5 6 7 8 9 10 11 12
m 1 1 1 2 2 3 4 6 8 12 16 23
Here are some hints from our professor: I suggest changing the problem to making the following table: Let R(n,m) be the range of numbers [1..R(n,m)] that A can indicate to B if they start with n stones, and both know that A has to also receive a number in [1..m] from B.
For example, if A needs no more information, R(n,1) can be computed by considering how many stones A could throw (one to n), then B thows 1 (if any remain) and A gets to decide again. The base cases R(0,1) = R(1,1) = 1, and you can write a recursive rule if you are careful at the boundaries. (You should find the Fibonacci numbers for R(n,1).)
If A needs information, then B has to send it by his or her choices, so things are a little more complicated. Here is the start of the table:
n\ m 1 2 3 4 5
0 1 0 0 0 0
1 1 0 0 0 0
2 2 0 0 0 0
3 3 1 0 0 0
4 5 2 1 0 0
5 8 4 2 1 1
6 13 7 4 3 2
7 21 12 8 6 4
8 34 20 15 11 8
9 55 33 27 19 16
From the R(n,m) table, how would you recover the entries of the earlier table (the table showing m as a function of n)?

GAMS, matrix direct assignment

I want to assign values to a 3-D table in GAMS. But it seems it doesn't work as in Matlab.....Any luck ? Code is as followed and the problem is at the last few lines:
Sets
n nodes / Sto , Lon , Par , Ber , War , Mad , Rom /
i scenarios / 1 * 4 /
k capacity level / L, N, H / ;
alias(n,m);
Table balance(n,i) traffic balance for different nodes
1 2 3 4
Sto 50 50 -50 -50
Lon -40 40 -40 40
Par 0 0 0 0
Ber 0 0 0 0
War 40 -40 40 -40
Mad 0 0 0 0
Rom -50 -50 50 50 ;
Scalar r fluctuation rate of the capacity level
/0.15/;
Parameter p(k) probability of each level
/ L 0.25
N 0.5
H 0.25 / ;
Table nor_cap(n,m) Normal capacity level from n to m
Sto Lon Par Ber War Mad Rom
Sto 0 11 14 25 30 0 0
Lon 11 0 21 0 0 14 0
Par 14 21 0 22 0 31 19
Ber 25 0 22 0 26 0 18
War 30 0 0 26 0 18 22
Mad 0 14 31 0 18 0 15
Rom 0 0 19 18 22 15 0 ;
Table max_cap(n,m,k) capacity level under each k
max_cap(n,m,'N')=nor_cap(n,m)
max_cap(n,m,'L')=nor_cap(n,m)*(1-r)
max_cap(n,m,'H')=nor_cap(n,m)*(1+r);
The final assignment to a 3-D matrix should be done with PARAMETER as opposed to TABLE. In general I would also note that TABLE is very restrictive (2 dimensional, text input inside the code). You might want to consider $GDXIN (or EXECUTE_LOAD) and some of the GAMS utilities for loading xls or csv files.
As a user of both MATLAB and GAMS I would note that GAMS depends on "indices" for every array, but otherwise they can be quite similar. In your case max_cap(n,m,k) would be something like the maximum capacity between from_city and to_city under each capacity level scenario. Your matrix needs to be declared as a PARAMETER which can be any n-dimensional (indexed) matrix, including even a SCALAR.
Also, try the GAMS mailing list if you really need an answer quickly, the number of proficient GAMS users globally can't be more than a few thousand, so it might be hard to find a quick answer on StackOverflow - awesome as it is for the more common languages.

Resources