Stata -- extract regression output for 3500 regressions run in a loop - for-loop

I am using a forval loop to run 3,500 regressions, one for each group. I then need to summarize the results. Typically, when I use loops to run regressions I use the estimates store function followed by estout. Below is a sample code. But I believe there is a limit of 300 that this code can handle. I would very much appreciate if someone could let me know how to automate the process for 3,500 regressions.
Sample code:
forval j = 1/3500 {
regress y x if group == `j'
estimates store m`j', title(Model `j')
}
estout m* using "Results.csv", cells(b t) ///
legend label varlabels(_cons constant) ///
stats(r2 df_r N, fmt(3 0 1) label(R-sqr dfres N)) replace

Here's an example using statsby where I run a regression of price on mpg for each of the 5 groups defined by the rep78 variable and store the results in Stata dataset called my_regs:
sysuse auto, clear
statsby _b _se, by(rep78) saving(my_regs): reg price mpg
use my_regs.dta
If you prefer, you can omit the saving() option and then your dataset will be replaced in memory by the regression results, so you won't need to open the file directly with use.

Related

SAS: What's the optimal way to find the sum of a column by another column?

I want to find out the best way to perform a group-by in SAS so I can perform some benchmarks. The simplest two ways I can think of is Proc SQL and Proc means. Here is the example in proc sql
proc sql noprint; /* took 6 mins */
create table summ as select
id,
sum(val)
from
randint
group by
id
;
quit;
I think there are ways to make this run fast
use sasfile command to load the data into memory first
create an index on id
Are there any other options I can use? Any SAS options I should turn on to make this run as fast as possible? I am not tied to proc sql nor proc means, so if there are faster ways then I would love to know about it!!!
My set up code is as below
options macrogen;
options obs=max sortsize=max source2 FULLSTIMER;
options minoperator SASTRACE=',,,d' SASTRACELOC=SASLOG;
options compress = binary NOSTSUFFIX;
options noxwait noxsync;
options LRECL=32767;
proc fcmp outlib=work.myfunc.sample;
function RandBetween(min, max);
return (min + floor((1 + max - min) * rand("uniform")));
endsub;
run;
options cmplib=work.myfunc;
data RandInt;
do i = 1 to 250000000;
id = RandBetween(1, 2500000);
val = rand("uniform");
output;
end;
drop i;
run;
My SAS comparison macros are as below
%macro sasbench(dosql = N); %macro _; %mend;
%if &dosql. = Y %then %do;
proc sql noprint; /* took 6 mins */
create table summ as select
id,
sum(val)
from
randint
group by
id
;
quit;
%end;
proc means data=randint sum noprint;
var val ;
class id;
output out = summmeans(drop=_type_ _freq_) sum = /autoname;
run;
%mend;
%sasbench();
/**/
/*sasfile randint load;*/
/*%sasbench();*/
/*sasfile randint close;*/
proc datasets lib=work;
modify randint;
INDEX CREATE id / nomiss;
run;
%sasbench();
sasfile is only a benefit if the entire data set can fit into session ram limits and if the data set is going to be used more than once. I suppose this would make sense if your benchmark includes multiple runs / different techniques on the same sasfile.
An index on id would help if the data was unsorted by id. When the data set is presorted by id the id column metadata will have sortedby flag set which a procedure can use for its own internal optimization, however there is no guarantee. As for indexes, use option msglevel=i to get informational messages in the log about index selection during processing.
The fastest way is direct addressing, but requires enough ram to handle the largest id value as an array index:
array ids(250000000) _temporary_
ids(id) + value
The next fastest way is probably hand coded array based hashing:
search SAS conference proceedings for papers by Paul Dorfman
The next fastest hash way is probably the hash component object with key suminc.
DATA Step was edited to align with the comments
data demo_data;
do rownum = 1 to 1000;
id = ceil(100*ranuni(123)); * NOTE: 100 different groups, disordered;
value = ceil(1000*ranuni(123)); * NOTE: want to sum value over group, for demonstration individual values integers from 1..1000;
output;
end;
run;
data _null_;
if 0 then set demo_data(keep=id value); %* prep pdv ;
length total 8; %* prep keysum variable ;
call missing (total); %* prevent warnings ;
declare hash ids (ordered:'a', suminc:'value', keysum:'total'); %* ordered ensures keys will be sorted ascending upon output ;
ids.defineKey('id');
*ids.defineData('id'); % * not having a defineData is an implicit way of adding only the keys as data, only data + keysum variables are .output;
ids.defineDone();
* read all records and touch each hash key in order to perform tacit total+value summation;
do until (end);
set demo_data end=end;
if ids.find() ne 0 then ids.add();
end;
ids.output(dataset:'sum_value_over_id'); * save the summation of each key combination;
stop;
run;
Note: There can be only one keysum variable.
If the suminc variable was set to be always 1 instead of value, then the keysum would be the count instead of the total.
Obtaining both sum and count over group via hash would require an explicit defineData for a count and sum variable and slightly different statements, such as:
declare hash ids (ordered:'a');
...
ids.defineData('id', 'count', 'total');
...
if ids.find() ne 0 then do; count=0; total=0; end;
count+1;
total+value;
ids.replace();
...
However, if value is known to be always a natural number, and group size is known to be < 10group size limit you could numerically encode the count by using a suminc of value + 10-group size limit and numerically decode count by processing the output data with count = (total - int(total)) * 10group size limit.
For sorted data the fastest way is most likely a DOW loop with accumulation.
proc sort data=foo;
by id;
data sum_value_over_id_v2(keep=id total);
do until (last.id);
set foo;
by id;
total = sum(total, value);
end;
run;
You will likely find that I/O is largest component of performance.
The best answer varies dramatically by the application. In your example, PROC SQL at least on my machine significantly outperforms PROC MEANS, but there are plenty of cases where it will not do so. It's able to in this case because it's building hash tables behind the scenes, more than likely, which are quite fast - a single pass through the data is all that's needed.
You certainly could speed things up by putting your full dataset into memory with SASFILE, if you have room to store the whole thing. You would have to have it in memory to begin with, though, more than likely; just reading it into memory for this purpose alone wouldn't really help since you're doing that read anyway.
As Richard notes, there are a bunch of ways to do this. I think PROC SQL will often be the fastest or similar to the fastest in simple cases, both because it's multithreaded (as opposed to data step being single threaded) and because it's got a fast hash table backend.
PROC MEANS is also usually going to be competitive, the case you show in the example is almost a worst case for it since it's got a huge number of class variables so I think it may be creating a temporary table on disk. It's also multithreaded. Reduce the class variable categories to 2500 instead of 2,500,000 and you get PROC MEANS a bit faster than PROC SQL (but within the margin of error).
Data step accumulation, either in a hash table or a DoW loop, will sometimes outperform both of the above, and sometimes not, again depending on the data. Here it does outperform slightly. The code for data step accumulation tends to be a bit more complex, which is why I'd usually discourage it unless the savings is substantial (having more code to maintain is worse, typically). PROC MEANS and PROC SQL require less maintenance and less to understand. But in applications where performance is critical and these solutions happen to be superior, it may be worth it to go this route, especially if the data step is helpful. Of course, the hash table method is limited to fitting the results in memory, though usually that's manageable.
Ultimately, I would encourage you to use whatever method is easiest to maintain but still gives sufficient performance; and when possible try to be self consistent with other code. If most of your code is in SQL, that is probably fine. SASFILE and indexes probably won't be needed, unless you're doing more complicated things than you present above. Summation is actually more work than I/O in many cases. Don't overcomplicate it, ultimately: programmer hours and difficulty of QA is something that should trump basic performance, unless you're talking several hours' difference. And if you are, then just run tests on your actual use case and see what works best.
If you assume the data is sorted then this is another solution
data sum_value_over_id_v2(keep=id total);
set a.randint(keep=id val);
by id;
total + val;
if last.id then do;
output;
total = 0;
end;
drop val;
run;

SPSS Count function advice

Im having some issues with some SPSS code. Im new to SPSS and still trying to figure out the syntax. I'm trying to get my code to count the sum of two dice equal to 7. I cant get the count function to work the way I want it. Below is my code. Any tips would be greatly appreciated.
INPUT PROGRAM.
LOOP #I=1 TO 100000.
COMPUTE case = 1.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
EXECUTE.
COMPUTE Dice_1 = TRUNC (RV.UNIFORM(1,7)).
COMPUTE Dice_2 = TRUNC (RV.UNIFORM(1,7)).
COMPUTE total = Dice_1+Dice_2.
COMPUTE Number_Sum7= Dice_1+Dice_2 = 7.
COUNT Num= case TO Number_Sum7(1).
SAVE outfile = 'my file path'.
count function counts across a list of variables, in each line separately.
What you seem to be looking for is to count over rows.
You can start with:
frequencies total. /* see counts of all possible totals.
means Number_Sum7/cells=sum. /* count only the cases where total=7.
These will give you the answers in the output window.
If you want the answers in data for further analysis, look up the aggregate function.
For example, the following will give you the same results but in a new datasets:
DATASET NAME ORIG.
DATASET DECLARE freqs.
AGGREGATE /OUTFILE='freqs' /BREAK=total /Mycount=N.
DATASET ACTIVATE ORIG.
DATASET DECLARE only7.
AGGREGATE /OUTFILE='only7' /BREAK= /only7=sum(Number_Sum7).
Or, instead, you can add the results to your present data:
AGGREGATE /OUTFILE=* MODE=ADDVARIABLES /BREAK=total /TotalCount=N.
AGGREGATE /OUTFILE=* MODE=ADDVARIABLES /BREAK= /total7=sum(Number_Sum7).

Discrepancy between R and Matlab speed [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Why are loops slow in R?
Consider the following task. A dataset has 40 variables for 20,000 "users". Each user has between 1 and 150 observations. All users are stacked in a matrix called data. The first column is the id of the user and identifies the user. All id are stored in a 20,000 X 1 matrix called userid.
Consider the following R code
useridl = length(userid)
itime=proc.time()[3]
for (i in 1:useridl) {
temp =data[data[,1]==userid[i],]
}
etime=proc.time()[3]
etime-itime
This code just goes through the 20,000 users, creating the temp matrix every time. With the subset of observations belonging to userid[i]. It takes about 6 minutes in a MacPro.
In MatLab, the same task
tic
for i=1:useridl
temp=data(data(:,1)==userid(i),:);
end
toc
takes 1 minute.
Why is R so much slower? This is standard task, I am using matrices in both cases. Any ideas?
As #joran commented, that's bad R practice. Instead of repeatedly subsetting your original matrix, just put the subsets in a list once and then iterate over the list with lapply or similar.
# make example data
set.seed(21)
userid <- 1:1e4
obs <- sample(150, length(userid), TRUE)
users <- rep(userid, obs)
Data <- cbind(users,matrix(rnorm(40*sum(obs)),sum(obs),40))
# reorder so Data isn't sorted by userid
Data <- Data[order(Data[,2]),]
# note that you have to call the data.frame method explicitly,
# the default method returns a vector
system.time(temp <- split.data.frame(Data, Data[,1])) ## Returns times in seconds
# user system elapsed
# 2.84 0.08 2.92
My guess is that the garbage collector is slowing down your R code, since you're continually overwriting the temp object.

Creating a SPSS loop function

I need to create a syntax loop that runs a series of transformation
This is a simplified example of what I need to do
I would like to create five fruit variables
apple_variable
banana_variable
mango_variable
papaya_variable
orange_variable
in V1
apple=1
banana=2
mango=3
papaya=4
orange=5
First loop
IF (V1={number}) {fruit}_variable = VX.
IF (V2={number}) {fruit}_variable = VY.
IF (V3={number}) {fruit}_variable = VZ.
Run loop for next fruit
So what I would like is the scripte to check if V1, V2 or V3 contains the fruit number. If one of them does (only one can) The new {fruit}_variable should get the value from VX, VY or VZ.
Is this possible? The script need to create over 200 variables so a bit to time consuming to do manually
The first loop can be put within a DO REPEAT command. Essentially you define your two lists of variables and you can loop over the set of if statements.
DO REPEAT V# = V1 V2 V3
/VA = VX VY VZ.
if V# = 1 apple_variable = VA.
END REPEAT.
Now 1 and apple_variable are hard coded in the example above, but we can roll this up into a simple macro statement to take arbitrary parameters.
DEFINE !fruit (!POSITIONAL = !TOKENS(1)
/!POSITIONAL = !TOKENS(1)).
DO REPEAT V# = V1 V2 V3
/VA = VX VY VZ.
if V# = !1 !2 = VA.
END REPEAT.
!ENDDEFINE.
!fruit 1 apple_variable.
Now this will still be a bit tedious for over 200 variables, but should greatly simplify the task. After I get this far I typically just do text editing to my list to call the macro 200 times, which in this instance all that it would take is inserting !fruit before the number and the resulting variable name. This works well especially if the list is static.
Other approaches using the in-built SPSS facilities (mainly looping within the defined MACRO) IMO tend to be ugly, can greatly complicate the code and frequently not worth the time (although certainly doable). Although that is somewhat mitigated if you are willing to accept a solution utilizing python commands.
DO REPEAT is a good solution here, but I'm wondering what the ultimate goal is. This smells like a problem that might be solved by using the multiple response facilities in Statistics without the need to go through these transformations. Multiple response functionality is available in the old MULTIPLE RESPONSE procedure and in the newer CTABLES and Chart Builder facilities.
HTH,
Jon Peck
combination of loop statements: for,while, do while with nested if..else and switch case will do the trick. just make sure you have your initial value and final value for the loop to go
let's say:
for (initial; final; increment)
{
if (x == value) {
statements;
}else{
...
}

In Stata, how do I manipulate matrix elements by their name?

In Stata, after a regression I know it is possible to call the elements of stored results by name. For example, if I want to manipulate the coefficient on the variable precip, I just type _b[precip]. My question is how do I do the same after the tabstat command? For example, say I want to multiply the coefficient on precip by the sample mean of precip:
reg --variables in regression--
tabstat --variables in regression--
mat X=r(StatTotal)
mat Y=_b[precip]*X[1,precip]
Ah, if only it were that simple. But alas, in the last line X[1, precip] is invalid syntax. Oddly, Stata does recognize display X[1, precip]. And Stata would know what I'm trying to do if instead of precip I used the column number where precip appears in the X vector. If I were just doing this operation once, no problem. But I need to do this operation several times (for several different model specifications) and for several variables which change position in the vector from one model to the next, so I cannot just use the column number.
I am not yet sure I understand exactly what you want to do, but here's my attempt to reproduce what you are doing:
sysuse auto, clear
regress price mpg foreign weight
tabstat mpg foreign weight, save
matrix X = r(StatTotal)
matrix Y = _b[mpg]*X[1, colnumb(X, "mpg") ]
If you need to put this into a cycle, that's doable, too:
matrix bb = e(b)
local explvar : colnames bb
foreach x in `explvar' {
if "`x'" != "_cons" {
matrix Y_`x' = _b[`x'] * X[1, colnumb(X, "`x'")]
}
else {
matrix Y_`x' = _b[`x']
}
}
You'd probably want to put this into a program that you will call after each regression model estimation call, e.g.:
program define reg2mat , prefix( name )
if "`e(cmd)'" != "regress" {
// this will intentionally produce an error
regress
}
tempname bb
matrix `bb' = e(b)
local explvar : colnames `bb'
foreach x in `explvar' {
if "`x'" != "_cons" {
matrix `prefix'_`x' = _b[`x'] * X[1, colnumb(X, "`x'")]
}
else {
matrix `prefix'_`x' = _b[`x']
}
}
end // of reg2mat
At many levels, it is not ideal, as it manipulates with the (global) matrices in Stata memory; most of the time, it is a bad idea, as the programs should only manipulate with objects local to them.
I suspect that what you want to do is addressed, in one way or another, by either omnipowerful margins command, or by an appropriate predict, or by matrix score (which is the low level version of predict). Attributing the effects to a variable only makes sense when your regressors are orthogonal, which only happens in carefully designed and conducted experiments.

Resources