Calculate within correlation in panel data long form - panel

We have a simple panel data set in long form, which has the following structure:
i t x
1 Aug-2011 282
2 Aug-2011 -220
1 Sep-2011 334
2 Sep-2011 126
1 Sep-2012 -573
2 Sep-2012 305
1 Nov-2013 335
2 Nov-2013 205
3 Nov-2013 485
I would like to get the cross-correlation between each i within the time-variable t.
This would be possible by converting the data in wide format. Unfortunately, this approach is not feasible due to the big number of i and t values in the real data set.
Is it possible to do something like in this fictional command:
by (tabulate t): corr x

You can easily calculate the correlations of a single variable such as x across panel groups using the reshape option of the community-contributed command pwcorrf:
ssc install pwcorrf
For illustration, consider (a slightly simplified version of) your toy example:
clear
input i t x
1 2011 282
2 2011 -220
1 2012 334
2 2012 126
1 2013 -573
2 2013 305
1 2014 335
2 2014 205
3 2014 485
end
xtset i t
panel variable: i (unbalanced)
time variable: t, 2011 to 2014
delta: 1 unit
pwcorrf x, reshape
Variable(s): x
Panel var: i
corrMatrix[3,3]
1 2 3
1 1 0 0
2 -.54223207 1 0
3 . . 1

Related

How can you improve computation time when predicting KNN Imputation?

I feel like my run time is extremely slow for my data set, this is the code:
library(caret)
library(data.table)
knnImputeValues <- preProcess(mainData[trainingRows, imputeColumns], method = c("zv", "knnImpute"))
knnTransformed <- predict(knnImputeValues, mainData[ 1:1000, imputeColumns])
the PreProcess into knnImputeValues run's fairly quickly, however the predict function takes a tremendous amount of time. When I calculated it on a subset of the data this was the result:
testtime <- system.time(knnTransformed <- predict(knnImputeValues, mainData[ 1:15000, imputeColumns
testtime
user 969.78
system 38.70
elapsed 1010.72
Additionally, it should be noted that caret preprocess uses "RANN".
Now my full dataset is:
str(mainData[ , imputeColumns])
'data.frame': 1809032 obs. of 16 variables:
$ V1: int 3 5 5 4 4 4 3 4 3 3 ...
$ V2: Factor w/ 3 levels "1000000","1500000",..: 1 1 3 1 1 1 1 3 1 1 ...
$ V3: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ V4: int 2 5 5 12 4 5 11 8 7 8 ...
$ V5: int 2 0 0 2 0 0 1 3 2 8 ...
$ V6: int 648 489 489 472 472 472 497 642 696 696 ...
$ V7: Factor w/ 4 levels "","N","U","Y": 4 1 1 1 1 1 1 1 1 1 ...
$ V8: int 0 0 0 0 0 0 0 1 1 1 ...
$ V9: num 0 0 0 0 0 ...
$ V10: Factor w/ 56 levels "1","2","3","4",..: 45 19 19 19 19 19 19 46 46 46 ...
$ V11: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ V12: num 2 5 5 12 4 5 11 8 7 8 ...
$ V13: num 2 0 0 2 0 0 1 3 2 8 ...
$ V14: Factor w/ 4 levels "1","2","3","4": 2 2 2 2 2 2 2 2 3 3 ...
$ V15: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 2 ...
$ V16: num 657 756 756 756 756 ...
So is there something I'm doing wrong, or is this typical for how long it will take to run this? If you back of the envelop extrapolate (which I know isn't entire accurate) you'd get what 33 days?
Also it looks like system time is very low and user time is very high, is that normal?
My computer is a laptop, with a Intel(R) Core(TM) i5-6300U CPU # 2.40Ghz processor.
Additionally would this improve the runtime of the predict function?
cl <- makeCluster(4)
registerDoParallel()
I tried it, and it didn't seem to make a difference other than all the processors looked more active in my task manager.
FOCUSED QUESTION: I'm using Caret package to do KNN Imputation on 1.8 Million Rows, the way I'm currently doing it will take over a month to run, how do I write this in such a way that I could do it in a much faster amount of time(if possible)?
Thank you for any help provided. And the answer might very well be "that's how long it takes don't bother" I just want to rule out any possible mistakes.
You can speed this up via the imputation package and use of canopies which can be installed from Github:
Sys.setenv("PKG_CXXFLAGS"="-std=c++0x")
devtools::install_github("alexwhitworth/imputation")
Canopies use a cheap distance metric--in this case distance from the data mean vector--to get approximate neighbors. In general, we wish to keep the canopies each sized < 100k so for 1.8M rows, we'll use 20 canopies:
library("imputation")
to_impute <- mainData[trainingRows, imputeColumns] ## OP undefined
imputed <- kNN_impute(to_impute, k= 10, q= 2, verbose= TRUE,
parallel= TRUE, n_canopies= 20)
NOTE:
The imputation package requires numeric data inputs. You have several factor variables in your str output. They will cause this to fail.
You'll also get some mean vector imputation if you have fulling missing rows.
# note this example data is too small for canopies to be useful
# meant solely to illustrate
set.seed(2143L)
x1 <- matrix(rnorm(1000), 100, 10)
x1[sample(1:1000, size= 50, replace= FALSE)] <- NA
x_imp <- kNN_impute(x1, k=5, q=2, n_canopies= 10)
sum(is.na(x_imp[[1]])) # 0
# with fully missing rows
x2 <- x1; x2[5,] <- NA
x_imp <- kNN_impute(x2, k=5, q=2, n_canopies= 10)
[1] "Computing canopies kNN solution provided within canopies"
[1] "Canopies complete... calculating kNN."
row(s) 1 are entirely missing.
These row(s)' values will be imputed to column means.
Warning message:
In FUN(X[[i]], ...) :
Rows with entirely missing values imputed to column means.

Speed up code to compare fields in a struct

I have the struct Trajectories with field uniqueDate, dateAll, label: I want to compare the fields uniqueDate and dateAll and, if there is a correspondence, I will save in label a value from an other struct.
I have written this code:
for k=1:nCols
for j=1:size(Trajectories(1,k).dateAll,1)
for i=1:size(Trajectories(1,k).uniqueDate,1)
if (~isempty(s(1,k).places))&&(Trajectories(1,k).dateAll(j,1)==Trajectories(1,k).uniqueDate(i,1))&&(Trajectories(1,k).dateAll(j,2)==Trajectories(1,k).uniqueDate(i,2))&&(Trajectories(1,k).dateAll(j,3)==Trajectories(1,k).uniqueDate(i,3))
for z=1:24
if(Trajectories(1,k).dateAll(j,4)==z)&&(size(s(1,k).places.all,2)>=size(Trajectories(1,k).uniqueDate,1))
Trajectories(1,k).label(j)=s(1,k).places.all(z,i);
else if(Trajectories(1,k).dateAll(j,4)==z)&&(size(s(1,k).places.all,2)<size(Trajectories(1,k).uniqueDate,1))
for l=1:size(s(1,k).places.all,2)
Trajectories(1,k).label(l)=s(1,k).places.all(z,l);
end
end
end
end
end
end
end
end
E.g
Trajectories(1,4).dateAll=[1 2004 8 1 14 1 15 0 0 0 1 42 13 2;596 2004 8 1 16 20 14 0 0 0 1 29 12 NaN;674 2004 8 1 18 26 11 0 0 0 1 20 38 1;674 2004 8 2 10 7 40 0 0 0 14 26 5 3;674 2004 8 2 11 3 29 0 0 0 1 54 3 3;631 2004 8 2 11 57 56 0 0 0 0 30 8 2;1 2004 8 2 12 4 35 0 0 0 1 53 21 2;631 2004 8 2 12 52 58 0 0 0 0 20 36 2;631 2004 8 2 13 5 3 0 0 0 1 49 40 2;631 2004 8 2 14 0 20 0 0 0 1 56 12 2;631 2004 8 2 15 2 0 0 0 0 1 57 39 2;631 2004 8 2 16 1 4 0 0 0 1 55 53 2;1 2004 8 2 17 9 15 0 0 0 1 48 41 2];
Trajectories(1,4).uniqueDate= [2004 8 1;2004 8 2;2004 8 3;2004 8 4];
it runs but it's very very slow. How can I modify it to speed up?
Let's work from the inside out and see where it gets us.
Step 1: Simplify your comparison condition:
if (~isempty(s(1,k).places))&&(Trajectories(1,k).dateAll(j,1)==Trajectories(1,k).uniqueDate(i,1))&&(Trajectories(1,k).dateAll(j,2)==Trajectories(1,k).uniqueDate(i,2))&&(Trajectories(1,k).dateAll(j,3)==Trajectories(1,k).uniqueDate(i,3))
becomes
if (~isempty(s(1,k).places)) && all( Trajectories(1,k).dateAll(j,1:3)==Trajectories(1,k).uniqueDate(i,1:3) )
Then we want to remove this from a for-loop. The "intersect" function is useful here:
[ia i1 i2]=intersect(Trajectories(1,k).dateAll(:,1:3),Trajectories(1,k).uniqueDate(:,1:3),'rows');
We now have a vector i1 of all rows in dateAll that intersect with uniqueDate.
Now we can remove the loop comparing z using a similar approach:
[iz iz1 iz2] = intersect(Trajectories(1,k).dateAll(i1,4),1:24);
We have to be careful about our indices here, using a subset of a subset.
This simplifies the code to:
for k=1:nCols
if isempty(s(1,k).places)
continue; % skip to the next value of k, no need to do the rest of the comparison
end
[ia i1 i2]=intersect(Trajectories(1,k).dateAll(:,1:3),Trajectories(1,k).uniqueDate(:,1:3),'rows');
[iz iz1 iz2] = intersect(Trajectories(1,k).dateAll(i1,4),1:24);
usescalarlabel = (size(s(1,k).places.all,2)>=size(Trajectories(1,k).uniqueDate,1);
if (usescalarlabel)
Trajectories(1,k).label(i1(iz1)) = s(1,k).places.all(iz,i2(iz1));
else
% you will need to check this: I think here you were needlessly repeating this step for every match
Trajectories(1,k).label(i1(iz1)) = s(1,k).places.all(iz,:);
end
end
But wait! That z loop is exactly the same as using indexing. So we don't need that second intersect after all:
for k=1:nCols
if isempty(s(1,k).places)
continue; % skip to the next value of k, no need to do the rest of the comparison
end
[ia i1 i2]=intersect(Trajectories(1,k).dateAll(:,1:3),Trajectories(1,k).uniqueDate(:,1:3),'rows');
usescalarlabel = (size(s(1,k).places.all,2)>=size(Trajectories(1,k).uniqueDate,1);
label_indices = Trajectories(1,k).dateAll(i1,4);
if (usescalarlabel)
Trajectories(1,k).label(label_indices) = s(1,k).places.all(label_indices,i2);
else
% you will need to check this: I think here you were needlessly repeating this step for every match
Trajectories(1,k).label(label_indices) = s(1,k).places.all(label_indices,:);
end
end
You'll need to check the indexing in this - I'm sure I've made a mistake somewhere without having data to test against, but that should give you an idea on how to proceed removing the loops and using vector expressions instead. Without seeing the data that's as far as I can optimise. You may be able to go further if you can reformat your data into a set of 3d matrices / cells instead of using structs.
I am suspicious of your condition which I have called "usescalarlabel" - it seems like you are mixing two data types. Also I would strongly recommend separating the dateAll matrices into separate "date" and "data" matrices as the row indices 4 onwards don't seem to be dates. Also the example you copy/pasted in seems to have an extra value at row index 1? In that case you'll need to compare Trajectories(1,k).dateAll(:,2:4) instead of Trajectories(1,k).dateAll(:,1:3).
Good luck.

Obtain a different result when evaluating Stanford NLP sentiment

I downloaded Stanford NLP 3.5.2 and run sentiment analysis with default configuration (i.e. I did not change anything, just unzip and run).
java -cp "*" edu.stanford.nlp.sentiment.Evaluate -model edu/stanford/nlp/models/sentiment/sentiment.ser.gz -treebank test.txt
EVALUATION SUMMARY
Tested 82600 labels
66258 correct
16342 incorrect
0.802155 accuracy
Tested 2210 roots
976 correct
1234 incorrect
0.441629 accuracy
Label confusion matrix
Guess/Gold 0 1 2 3 4 Marg. (Guess)
0 323 161 27 3 3 517
1 1294 5498 2245 652 148 9837
2 292 2993 51972 2868 282 58407
3 99 602 2283 7247 2140 12371
4 0 1 21 228 1218 1468
Marg. (Gold) 2008 9255 56548 10998 3791
0 prec=0.62476, recall=0.16086, spec=0.99759, f1=0.25584
1 prec=0.55891, recall=0.59406, spec=0.94084, f1=0.57595
2 prec=0.88982, recall=0.91908, spec=0.75299, f1=0.90421
3 prec=0.58581, recall=0.65894, spec=0.92844, f1=0.62022
4 prec=0.8297, recall=0.32129, spec=0.99683, f1=0.46321
Root label confusion matrix
Guess/Gold 0 1 2 3 4 Marg. (Guess)
0 44 39 9 0 0 92
1 193 451 190 131 36 1001
2 23 62 82 30 8 205
3 19 81 101 299 255 755
4 0 0 7 50 100 157
Marg. (Gold) 279 633 389 510 399
0 prec=0.47826, recall=0.15771, spec=0.97514, f1=0.2372
1 prec=0.45055, recall=0.71248, spec=0.65124, f1=0.55202
2 prec=0.4, recall=0.2108, spec=0.93245, f1=0.27609
3 prec=0.39603, recall=0.58627, spec=0.73176, f1=0.47273
4 prec=0.63694, recall=0.25063, spec=0.96853, f1=0.35971
Approximate Negative label accuracy: 0.646009
Approximate Positive label accuracy: 0.732504
Combined approximate label accuracy: 0.695110
Approximate Negative root label accuracy: 0.797149
Approximate Positive root label accuracy: 0.774477
Combined approximate root label accuracy: 0.785832
The test.txt file is downloaded from http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip (contains train.txt, dev.txt and test.txt). The download link is get from http://nlp.stanford.edu/sentiment/code.html
However, in the paper "Socher, R., Perelygin, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y. and Potts, C., 2013, October. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the conference on empirical methods in natural language processing (EMNLP) (Vol. 1631, p. 1642)." which sentiment analysis tool is based on, the authors reported that the accuracy when classify 5 classes is 0.807.
Is my results I obtained normal?
I get the same results when I run it out of the box. It would not surprise me if the version of their system they made for Stanford CoreNLP differs slightly from the version in the paper.

Stata overwrite all observations in cross section except last 20 non NA

I have a large strongly unbalanced panel in Stata, where each cross section only has a few observations, and the rest is NA (.).
I want to overwrite all non NA observations that are not the last 20 non NA observations, in each cross section. I'm not sure how to correctly specify the range, but you can see my thoughts below. There are gaps between the observations.
Thanks
*Edit
I removed the code as it created uncertainty. It was included to show what I had tried.
My cross section dimension identifier is xsection
My time dimension identifier is id01
*Edit
I have created an example below. The code needs to extract the last 3 non NA (.) values of each cross section in variable x, and enter these into a new variable z. Alternatively, all observations in x should be set to . except the last 3 (with allowed gaps). It does not matter if a new variable z is created, or the observations in x is replaced so that it looks like z.
id01 xsection x z
2005 1 20 .
2006 1 21 .
2007 1 22 .
2008 1 23 23
2009 1 37 37
2010 1 38 38
2011 1 . .
2012 1 . .
2005 2 24 .
2006 2 25 .
2007 2 21 .
2008 2 27 27
2009 2 33 33
2010 2 . .
2011 2 37 37
2012 2 . .
Note that NA is the jargon of some other programs, but not native to Stata. Stata calls these "missing values".
If you just (1) segregate the observations with missing values, then immediately (2) identifying the last so many observations with non-missing values follows from sorting within the other observations, those with non-missing values.
. clear
. input id01 xsection x z
id01 xsection x z
1. 2005 1 20 .
2. 2006 1 21 .
3. 2007 1 22 .
4. 2008 1 23 23
5. 2009 1 37 37
6. 2010 1 38 38
7. 2011 1 . .
8. 2012 1 . .
9. 2005 2 24 .
10. 2006 2 25 .
11. 2007 2 21 .
12. 2008 2 27 27
13. 2009 2 33 33
14. 2010 2 . .
15. 2011 2 37 37
16. 2012 2 . .
17. end
. gen ismiss = missing(x)
. bysort ismiss xsection (id01) : gen z_last = z if _N - _n < 3
(10 missing values generated)
. sort id01 xsection
. assert z_last == z
Here z was supplied as what was wanted and z_last is calculated and shown to be equivalent.
This answer is a bit clunky, but it should get the job done. If x is the variable that you want to replace values to missing,
by xsection: gen maxCount = _N
by xsection: gen counter = _n
gen dropVar = maxCount - counter
replace x = . if dropVar >= 20
I am fairly sure that the equal sign should be included, but this would be easy to check.

How can I define a verb in J that applies a different verb alternately to each atom in a list?

Imagine I've defined the following name in J:
m =: >: i. 2 4 5
This looks like the following:
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
26 27 28 29 30
31 32 33 34 35
36 37 38 39 40
I want to create a monadic verb of rank 1 that applies to each list in this list of lists. It will double (+:) or add 1 (>:) to each alternate item in the list. If we were to apply this verb to the first row, we'd get 2 3 6 5 10.
It's fairly easy to get a list of booleans which alternate with each item, e.g., 0 1 $~{:$ m gives us 0 1 0 1 0. I thought, aha! I'll use something like +:`>: #. followed by some expression, but I could never quite get it to work.
Any suggestions?
UPDATE
The following appears to work, but perhaps it can be refactored into something more elegant by a J pro.
poop =: monad define
(($ y) $ 0 1 $~{:$ y) ((]+:)`(]>:) #. [)"0 y
)
I would use the oblique verb, with rank 1 (/."1)- so it applies to successive elements of each list in turn.
You can pass a gerund into /. and it applies them in order, extending cyclically.
+:`>: /."1 m
2
3
6
5
10
12
8
16
10
20
22
13
26
15
30
32
18
36
20
40
42
23
46
25
50
52
28
56
30
60
62
33
66
35
70
72
38
76
40
80
I spent a long time and I looked at it, and I believe that I know why ,# works to recover the shape of the argument.
The shape of the arguments to the parenthesized phrase is the shape of the argument passed to it on the right, even though the rank is altered by the " conjugate (well, that is what trace called it, I thought it was an adverb). If , were monadic, it would be a ravel, and the result would be a vector or at least of a lower rank than the input, based on adverbs to ravel. That is what happens if you take the conjunction out - you get a vector.
So what I believe is happening is that the conjunction is making , act like a dyadic , which is called an append. The append alters what it is appending to what it is appending to. It is appending to nothing but that thing still has a shape, and so it ends up altering the intermediate vector back to the shape of the input.
Now I'm probably wrong. But $,"0#(+:>:/.)"1 >: i. 2 4 5 -> 2 4 5 1 1` which I thought sort of proved my case.
(,#(+:`>:/.)"1 a) works, but note that ((* 2 1 $~ $)#(+ 0 1 $~ $)"1 a) would also have worked (and is about 20 times faster, on large arrays, in my brief tests).

Resources