How can you improve computation time when predicting KNN Imputation? - performance

I feel like my run time is extremely slow for my data set, this is the code:
library(caret)
library(data.table)
knnImputeValues <- preProcess(mainData[trainingRows, imputeColumns], method = c("zv", "knnImpute"))
knnTransformed <- predict(knnImputeValues, mainData[ 1:1000, imputeColumns])
the PreProcess into knnImputeValues run's fairly quickly, however the predict function takes a tremendous amount of time. When I calculated it on a subset of the data this was the result:
testtime <- system.time(knnTransformed <- predict(knnImputeValues, mainData[ 1:15000, imputeColumns
testtime
user 969.78
system 38.70
elapsed 1010.72
Additionally, it should be noted that caret preprocess uses "RANN".
Now my full dataset is:
str(mainData[ , imputeColumns])
'data.frame': 1809032 obs. of 16 variables:
$ V1: int 3 5 5 4 4 4 3 4 3 3 ...
$ V2: Factor w/ 3 levels "1000000","1500000",..: 1 1 3 1 1 1 1 3 1 1 ...
$ V3: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ V4: int 2 5 5 12 4 5 11 8 7 8 ...
$ V5: int 2 0 0 2 0 0 1 3 2 8 ...
$ V6: int 648 489 489 472 472 472 497 642 696 696 ...
$ V7: Factor w/ 4 levels "","N","U","Y": 4 1 1 1 1 1 1 1 1 1 ...
$ V8: int 0 0 0 0 0 0 0 1 1 1 ...
$ V9: num 0 0 0 0 0 ...
$ V10: Factor w/ 56 levels "1","2","3","4",..: 45 19 19 19 19 19 19 46 46 46 ...
$ V11: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ V12: num 2 5 5 12 4 5 11 8 7 8 ...
$ V13: num 2 0 0 2 0 0 1 3 2 8 ...
$ V14: Factor w/ 4 levels "1","2","3","4": 2 2 2 2 2 2 2 2 3 3 ...
$ V15: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 2 ...
$ V16: num 657 756 756 756 756 ...
So is there something I'm doing wrong, or is this typical for how long it will take to run this? If you back of the envelop extrapolate (which I know isn't entire accurate) you'd get what 33 days?
Also it looks like system time is very low and user time is very high, is that normal?
My computer is a laptop, with a Intel(R) Core(TM) i5-6300U CPU # 2.40Ghz processor.
Additionally would this improve the runtime of the predict function?
cl <- makeCluster(4)
registerDoParallel()
I tried it, and it didn't seem to make a difference other than all the processors looked more active in my task manager.
FOCUSED QUESTION: I'm using Caret package to do KNN Imputation on 1.8 Million Rows, the way I'm currently doing it will take over a month to run, how do I write this in such a way that I could do it in a much faster amount of time(if possible)?
Thank you for any help provided. And the answer might very well be "that's how long it takes don't bother" I just want to rule out any possible mistakes.

You can speed this up via the imputation package and use of canopies which can be installed from Github:
Sys.setenv("PKG_CXXFLAGS"="-std=c++0x")
devtools::install_github("alexwhitworth/imputation")
Canopies use a cheap distance metric--in this case distance from the data mean vector--to get approximate neighbors. In general, we wish to keep the canopies each sized < 100k so for 1.8M rows, we'll use 20 canopies:
library("imputation")
to_impute <- mainData[trainingRows, imputeColumns] ## OP undefined
imputed <- kNN_impute(to_impute, k= 10, q= 2, verbose= TRUE,
parallel= TRUE, n_canopies= 20)
NOTE:
The imputation package requires numeric data inputs. You have several factor variables in your str output. They will cause this to fail.
You'll also get some mean vector imputation if you have fulling missing rows.
# note this example data is too small for canopies to be useful
# meant solely to illustrate
set.seed(2143L)
x1 <- matrix(rnorm(1000), 100, 10)
x1[sample(1:1000, size= 50, replace= FALSE)] <- NA
x_imp <- kNN_impute(x1, k=5, q=2, n_canopies= 10)
sum(is.na(x_imp[[1]])) # 0
# with fully missing rows
x2 <- x1; x2[5,] <- NA
x_imp <- kNN_impute(x2, k=5, q=2, n_canopies= 10)
[1] "Computing canopies kNN solution provided within canopies"
[1] "Canopies complete... calculating kNN."
row(s) 1 are entirely missing.
These row(s)' values will be imputed to column means.
Warning message:
In FUN(X[[i]], ...) :
Rows with entirely missing values imputed to column means.

Related

Is a 1D or 2D array more computationally efficient in Lua for a large matrix?

In Lua specifically, which is the least computationally expensive: a matrix in which an item found at (row, column) is located at matrix[row][column] or is located at matrix[row + numberOfRows * column]?
Assume that these items will be read and written to a lot, and assume that the matrix is large at about 1000 by 2000 items.
I mainly care about efficiency in the moment rather than overhead.
As shown below, matrix[row][column] uses one less VM instruction than matrix[row + numberOfRows * column]. However, it is not clear whether one GETTABLE is faster than MUL+ADD.
The only real answer is: measure both alternatives.
$ cat 1
local matrix,row,numberOfRows,column
return matrix[row][column]
$ luac -l 1
main <1:0,0> (5 instructions at 0x7f9459c03d40)
0+ params, 5 slots, 1 upvalue, 4 locals, 0 constants, 0 functions
1 [1] LOADNIL 0 3
2 [2] GETTABLE 4 0 1
3 [2] GETTABLE 4 4 3
4 [2] RETURN 4 2
5 [2] RETURN 0 1
$ cat 2
local matrix,row,numberOfRows,column
return matrix[row + numberOfRows * column]
$ luac -l 2
main <2:0,0> (6 instructions at 0x7ff339c03d40)
0+ params, 5 slots, 1 upvalue, 4 locals, 0 constants, 0 functions
1 [1] LOADNIL 0 3
2 [2] MUL 4 2 3
3 [2] ADD 4 1 4
4 [2] GETTABLE 4 0 4
5 [2] RETURN 4 2
6 [2] RETURN 0 1
However, a loop like this
for row=1,numberOfRows do
for column=1,numberOfColumns do
matrix[row][column]=f(row,column)
end
end
is probably slower than this
for row=1,numberOfRows do
local r=matrix[row]
for column=1,numberOfColumns do
r[column]=f(row,column)
end
end
Again, measure both alternatives.

Speed up code to compare fields in a struct

I have the struct Trajectories with field uniqueDate, dateAll, label: I want to compare the fields uniqueDate and dateAll and, if there is a correspondence, I will save in label a value from an other struct.
I have written this code:
for k=1:nCols
for j=1:size(Trajectories(1,k).dateAll,1)
for i=1:size(Trajectories(1,k).uniqueDate,1)
if (~isempty(s(1,k).places))&&(Trajectories(1,k).dateAll(j,1)==Trajectories(1,k).uniqueDate(i,1))&&(Trajectories(1,k).dateAll(j,2)==Trajectories(1,k).uniqueDate(i,2))&&(Trajectories(1,k).dateAll(j,3)==Trajectories(1,k).uniqueDate(i,3))
for z=1:24
if(Trajectories(1,k).dateAll(j,4)==z)&&(size(s(1,k).places.all,2)>=size(Trajectories(1,k).uniqueDate,1))
Trajectories(1,k).label(j)=s(1,k).places.all(z,i);
else if(Trajectories(1,k).dateAll(j,4)==z)&&(size(s(1,k).places.all,2)<size(Trajectories(1,k).uniqueDate,1))
for l=1:size(s(1,k).places.all,2)
Trajectories(1,k).label(l)=s(1,k).places.all(z,l);
end
end
end
end
end
end
end
end
E.g
Trajectories(1,4).dateAll=[1 2004 8 1 14 1 15 0 0 0 1 42 13 2;596 2004 8 1 16 20 14 0 0 0 1 29 12 NaN;674 2004 8 1 18 26 11 0 0 0 1 20 38 1;674 2004 8 2 10 7 40 0 0 0 14 26 5 3;674 2004 8 2 11 3 29 0 0 0 1 54 3 3;631 2004 8 2 11 57 56 0 0 0 0 30 8 2;1 2004 8 2 12 4 35 0 0 0 1 53 21 2;631 2004 8 2 12 52 58 0 0 0 0 20 36 2;631 2004 8 2 13 5 3 0 0 0 1 49 40 2;631 2004 8 2 14 0 20 0 0 0 1 56 12 2;631 2004 8 2 15 2 0 0 0 0 1 57 39 2;631 2004 8 2 16 1 4 0 0 0 1 55 53 2;1 2004 8 2 17 9 15 0 0 0 1 48 41 2];
Trajectories(1,4).uniqueDate= [2004 8 1;2004 8 2;2004 8 3;2004 8 4];
it runs but it's very very slow. How can I modify it to speed up?
Let's work from the inside out and see where it gets us.
Step 1: Simplify your comparison condition:
if (~isempty(s(1,k).places))&&(Trajectories(1,k).dateAll(j,1)==Trajectories(1,k).uniqueDate(i,1))&&(Trajectories(1,k).dateAll(j,2)==Trajectories(1,k).uniqueDate(i,2))&&(Trajectories(1,k).dateAll(j,3)==Trajectories(1,k).uniqueDate(i,3))
becomes
if (~isempty(s(1,k).places)) && all( Trajectories(1,k).dateAll(j,1:3)==Trajectories(1,k).uniqueDate(i,1:3) )
Then we want to remove this from a for-loop. The "intersect" function is useful here:
[ia i1 i2]=intersect(Trajectories(1,k).dateAll(:,1:3),Trajectories(1,k).uniqueDate(:,1:3),'rows');
We now have a vector i1 of all rows in dateAll that intersect with uniqueDate.
Now we can remove the loop comparing z using a similar approach:
[iz iz1 iz2] = intersect(Trajectories(1,k).dateAll(i1,4),1:24);
We have to be careful about our indices here, using a subset of a subset.
This simplifies the code to:
for k=1:nCols
if isempty(s(1,k).places)
continue; % skip to the next value of k, no need to do the rest of the comparison
end
[ia i1 i2]=intersect(Trajectories(1,k).dateAll(:,1:3),Trajectories(1,k).uniqueDate(:,1:3),'rows');
[iz iz1 iz2] = intersect(Trajectories(1,k).dateAll(i1,4),1:24);
usescalarlabel = (size(s(1,k).places.all,2)>=size(Trajectories(1,k).uniqueDate,1);
if (usescalarlabel)
Trajectories(1,k).label(i1(iz1)) = s(1,k).places.all(iz,i2(iz1));
else
% you will need to check this: I think here you were needlessly repeating this step for every match
Trajectories(1,k).label(i1(iz1)) = s(1,k).places.all(iz,:);
end
end
But wait! That z loop is exactly the same as using indexing. So we don't need that second intersect after all:
for k=1:nCols
if isempty(s(1,k).places)
continue; % skip to the next value of k, no need to do the rest of the comparison
end
[ia i1 i2]=intersect(Trajectories(1,k).dateAll(:,1:3),Trajectories(1,k).uniqueDate(:,1:3),'rows');
usescalarlabel = (size(s(1,k).places.all,2)>=size(Trajectories(1,k).uniqueDate,1);
label_indices = Trajectories(1,k).dateAll(i1,4);
if (usescalarlabel)
Trajectories(1,k).label(label_indices) = s(1,k).places.all(label_indices,i2);
else
% you will need to check this: I think here you were needlessly repeating this step for every match
Trajectories(1,k).label(label_indices) = s(1,k).places.all(label_indices,:);
end
end
You'll need to check the indexing in this - I'm sure I've made a mistake somewhere without having data to test against, but that should give you an idea on how to proceed removing the loops and using vector expressions instead. Without seeing the data that's as far as I can optimise. You may be able to go further if you can reformat your data into a set of 3d matrices / cells instead of using structs.
I am suspicious of your condition which I have called "usescalarlabel" - it seems like you are mixing two data types. Also I would strongly recommend separating the dateAll matrices into separate "date" and "data" matrices as the row indices 4 onwards don't seem to be dates. Also the example you copy/pasted in seems to have an extra value at row index 1? In that case you'll need to compare Trajectories(1,k).dateAll(:,2:4) instead of Trajectories(1,k).dateAll(:,1:3).
Good luck.

Dynamic Programming - Two spies at the river

I think this is a very complicated dynamic programming problem.
Two spies each have a secret number in [1..m]. To exchange numbers they agree to meet at the river and "innocently" take turns throwing stones: from a pile of n=26 identical stones, each spy in turn throws at least one stone in the river.
The only information is in the number of stones each thrown in each turn. What is the largest m can be so they are sure they can complete the exchange?
Develop a recursive formula to count. Here is the start of the table; complete it to n=26. (You should not expect a closed form.)
n 1 2 3 4 5 6 7 8 9 10 11 12
m 1 1 1 2 2 3 4 6 8 12 16 23
Here are some hints from our professor: I suggest changing the problem to making the following table: Let R(n,m) be the range of numbers [1..R(n,m)] that A can indicate to B if they start with n stones, and both know that A has to also receive a number in [1..m] from B.
For example, if A needs no more information, R(n,1) can be computed by considering how many stones A could throw (one to n), then B thows 1 (if any remain) and A gets to decide again. The base cases R(0,1) = R(1,1) = 1, and you can write a recursive rule if you are careful at the boundaries. (You should find the Fibonacci numbers for R(n,1).)
If A needs information, then B has to send it by his or her choices, so things are a little more complicated. Here is the start of the table:
n\ m 1 2 3 4 5
0 1 0 0 0 0
1 1 0 0 0 0
2 2 0 0 0 0
3 3 1 0 0 0
4 5 2 1 0 0
5 8 4 2 1 1
6 13 7 4 3 2
7 21 12 8 6 4
8 34 20 15 11 8
9 55 33 27 19 16
From the R(n,m) table, how would you recover the entries of the earlier table (the table showing m as a function of n)?

Cumulative Maxima as Indicated by X in APL

The third item in the FinnAPL Library is called “Cumulative maxima (⌈) of subvectors of Y indicated by X ” where X is a binary vector and Y os a vector of numbers. Here's an example of its usage:
X←1 0 0 0 1 0 0 0
Y←9 78 3 2 50 7 69 22
Y[A⍳⌈\A←⍋A[⍋(+\X)[A←⍋Y]]] ⍝ output 9 78 78 78 50 50 69 69
You can see that beginning from either the beginning or from any 1 value in the X array, the cumulave maximum is found for all corresponding digits in Y until another 1 is found in X. In the example given, X is divding the array into two equal parts of 4 numbers each. In the first part, 9 is the maxima until 78 is encountered, and in the second part 50 is the maxima until 69 is encountered.
That's easy enough to understand, and I could blindly use it as is, but I'd like to understand how it works, because APL idioms are essentially algorithms made up of operators and functions. To understand APL well, it's important to understand how the masters were able to weave it all together into such compact and elegant lines of code.
I find this particular idiom especially hard to understand because of the indexing nested two layers deep. So my question is, what makes this idiom tick?
This idiom can be broken down into smaller idioms, and most importantly, it contains idiom #11 from the FinnAPL Library entitled:
Grade up (⍋) for sorting subvectors of Y indicated by X
Using the same values for X and Y given in the question, here's an example of its usage:
X←1 0 0 0 1 0 0 0
Y←9 78 3 2 50 7 69 22
A[⍋(+\X)[A←⍋Y]] ⍝ output 4 3 1 2 6 8 5 7
As before, X is dividing the vector into two halves, and the output indicates, for each position, what digit of Y is needed to sort each of the halves. So, the 4 in the output is saying that it needs the 4th digit of Y (2) in the 1st position; the 3 indicates the 3rd digit (3) in the 2nd position; the 1 indicates the 1st digit (9) in the third position; etc. Thus, if we apply this indexing to Y, we get:
Y[A[⍋(+\X)[A←⍋Y]]] ⍝ output 2 3 9 78 7 22 50 69
In order to understand the indexing within this grade-up idiom, consider what is happening with the following:
(+\X)[A←⍋Y] ⍝ Sorted Cumulative Addition
Breaking it down step by step:
A←⍋Y ⍝ 4 3 6 1 8 5 7 2
+\X ⍝ 1 1 1 1 2 2 2 2
(+\X)[A←⍋Y] ⍝ 1 1 2 1 2 2 2 1 SCA
A[⍋(+\X)[A←⍋Y]] ⍝ 4 3 1 2 6 8 5 7
You can see that sorted cumulative addition (SCA) of X 1 1 2 1 2 2 2 1 applied to A acts as a combination of compress left and compress right. All values of A that line up with a 1 are moved to the left, and those lining up with a 2 move to the right. Of course, if X had more 1s, it would be compressing and locating the compressed packets in the order indicated by the values of the SCA result. For example, if the SCA of X were like 3 3 2 1 2 2 1 1 1, you would end up with the 4 digits corresponding to the 1s, followed by the 3 digits corresponding to the 2s, and finally, the 2 digits corresponding to the 3s.
You may have noticed that I skipped the step that would show the effect of grade up ⍋:
(+\X)[A←⍋Y] ⍝ 1 1 2 1 2 2 2 1 SCA
⍋(+\X)[A←⍋Y] ⍝ 1 2 4 8 3 5 6 7 Grade up
A[⍋(+\X)[A←⍋Y]] ⍝ 4 3 1 2 6 8 5 7
The effect of compression and rearrangement isn't accomplised by SCA alone. It effectively acts as rank, as I discussed in another post. Also in that post, I talked about how rank and index are essentially two sides of the same coin, and you can use grade up to switch between the two. Therefore, that is what is happening here: SCA is being converted to an index to apply to A, and the effect is grade-up sorted subvectors as indicated by X.
From Sorted Subvectors to Cumulative Maxima
As already described, the result of sorting the subvectors is an index, which when applied to Y, compresses the data into packets and arranges those packets according to X. The point is that it is an index, and once again, grade up is applied, which converts indexes into ranks:
⍋A[⍋(+\X)[A←⍋Y]] ⍝ 3 4 2 1 7 5 8 6
The question here is, why? Well, the next step is applying a cumulative maxima, and that really only makes sense if it is applied to values for rank which represent relative magnitude within each packet. Looking at the values, you can see that 4 is is the maxima for the first group of 4, and 8 is for the second group. Those values correspond to the input values of 78 and 69, which is what we want. It doesn't make sense (at least in this case) to apply a maxima to index values, which represent position, so the conversion to rank is necessary. Applying the cumulative maxima gives:
⌈\A←⍋A[⍋(+\X)[A←⍋Y]] ⍝ 3 4 4 4 7 7 8 8
That leaves one last step to finish the index. After doing a cumulative maxima operation, the vector values still represent rank, so they need to be converted back to index values. To do that, the index-of operator is used. It takes the value in the right argument and returns their position as found in the left argument:
A⍳⌈\A←⍋A[⍋(+\X)[A←⍋Y]] ⍝ 1 2 2 2 5 5 7 7
To make it easier to see:
3 4 2 1 7 5 8 6 left argument
3 4 4 4 7 7 8 8 right argument
1 2 2 2 5 5 7 7 result
The 4 is in the 2nd position in the left argument, so the result shows a 2 for every 4 in the right argument. The index is complete, so applying it to Y, we get the expected result:
Y[A⍳⌈\A←⍋A[⍋(+\X)[A←⍋Y]]] ⍝ 9 78 78 78 50 50 69 69
My implementation:
X←1 0 0 0 1 0 0 0
Y←9 78 3 2 50 7 69 22
¯1+X/⍳⍴X ⍝ position
0 4
(,¨¯1+X/⍳⍴X)↓¨⊂Y
9 78 3 2 50 7 69 22 50 7 69 22
(1↓(X,1)/⍳⍴X,1)-X/⍳⍴X ⍝ length
4 4
(,¨(1↓(X,1)/⍳⍴X,1)-X/⍳⍴X)↑¨(,¨¯1+X/⍳⍴X)↓¨⊂Y
9 78 3 2 50 7 69 22
⌈\¨(,¨(1↓(X,1)/⍳⍴X,1)-X/⍳⍴X)↑¨(,¨¯1+X/⍳⍴X)↓¨⊂Y
9 78 78 78 50 50 69 69
∊⌈\¨(,¨(1↓(X,1)/⍳⍴X,1)-X/⍳⍴X)↑¨(,¨¯1+X/⍳⍴X)↓¨⊂Y
9 78 78 78 50 50 69 69
Have a nice day.

Evaluating the distribution of words in a grid

I'm creating a word search and am trying to calculate quality of the generated puzzles by verifying the word set is "distributed evenly" throughout the grid. For example placing each word consecutively, filling them up row-wise is not particularly interesting because there will be clusters and the user will quickly notice a pattern.
How can I measure how 'evenly distributed' the words are?
What I'd like to do is write a program that takes in a word search as input and output a score that evaluates the 'quality' of the puzzle. I'm wondering if anyone has seen a similar problem and could refer me to some resources. Perhaps there is some concept in statistics that might help? Thanks.
The basic problem is distribution of lines in a square or rectangle. You can eighter do this geometrically or using integer arrays. I will try the integer arrays here.
Let M be a matrix of your puzzle,
A B C D
E F G H
I J K L
M N O P
Let the word "EFGH" be an existent word, as well as "CGKO". Then, create a matrix which will contain the count of membership in eighter words in each cell:
0 0 1 0
1 1 2 1
0 0 1 0
0 0 1 0
Apply a rule: the current cell value is equal to the sum of all neighbours (4-way) and multiply with the cell's original value, if the original value is 2 or higher.
0 0 1 0 1 2 2 2
1 1 2 1 -\ 1 3 8 2
0 0 1 0 -/ 1 2 3 2
0 0 1 0 0 1 1 1
And sum up all values in rows and columns the matrix:
1 2 2 2 = 7
1 3 8 2 = 14
1 2 3 2 = 8
0 1 1 1 = 3
| | | |
3 7 | 6
14
Then calculate the avarage of both result sets:
(7 + 14 + 8 + 3) / 4 = 32 / 4 = 8
(3 + 7 + 14 + 6) / 4 = 30 / 4 = 7.5
And calculate the avarage difference to the avarage of each result set:
3 <-> 7.5 = 4.5 7 <-> 8 = 1
7 <-> 7.5 = 0.5 14 <-> 8 = 6
14 <-> 7.5 = 6.5 8 <-> 8 = 0
6 <-> 7.5 = 1.5 3 <-> 8 = 5
___avg ___avg
3.25 3
And multiply them together:
3 * 3.25 = 9.75
Which you treat as a distributionscore. You might need to tweak it a little bit to make it work better, but this should calculate distributionscores quite nicely.
Here is an example of a bad distribution:
1 0 0 0 1 1 0 0 2
1 0 0 0 -\ 2 1 0 0 -\ 3 -\ C avg 2.5 -\ C avg-2-avg 0.5
1 0 0 0 -/ 2 1 0 0 -/ 3 -/ R avg 2.5 -/ R avg-2-avg 2.5
1 0 0 0 1 1 0 0 2 _____*
6 4 0 0 1.25 < score
Edit: calc. errors fixed.

Resources