In Stata, how do I manipulate matrix elements by their name? - matrix

In Stata, after a regression I know it is possible to call the elements of stored results by name. For example, if I want to manipulate the coefficient on the variable precip, I just type _b[precip]. My question is how do I do the same after the tabstat command? For example, say I want to multiply the coefficient on precip by the sample mean of precip:
reg --variables in regression--
tabstat --variables in regression--
mat X=r(StatTotal)
mat Y=_b[precip]*X[1,precip]
Ah, if only it were that simple. But alas, in the last line X[1, precip] is invalid syntax. Oddly, Stata does recognize display X[1, precip]. And Stata would know what I'm trying to do if instead of precip I used the column number where precip appears in the X vector. If I were just doing this operation once, no problem. But I need to do this operation several times (for several different model specifications) and for several variables which change position in the vector from one model to the next, so I cannot just use the column number.

I am not yet sure I understand exactly what you want to do, but here's my attempt to reproduce what you are doing:
sysuse auto, clear
regress price mpg foreign weight
tabstat mpg foreign weight, save
matrix X = r(StatTotal)
matrix Y = _b[mpg]*X[1, colnumb(X, "mpg") ]
If you need to put this into a cycle, that's doable, too:
matrix bb = e(b)
local explvar : colnames bb
foreach x in `explvar' {
if "`x'" != "_cons" {
matrix Y_`x' = _b[`x'] * X[1, colnumb(X, "`x'")]
}
else {
matrix Y_`x' = _b[`x']
}
}
You'd probably want to put this into a program that you will call after each regression model estimation call, e.g.:
program define reg2mat , prefix( name )
if "`e(cmd)'" != "regress" {
// this will intentionally produce an error
regress
}
tempname bb
matrix `bb' = e(b)
local explvar : colnames `bb'
foreach x in `explvar' {
if "`x'" != "_cons" {
matrix `prefix'_`x' = _b[`x'] * X[1, colnumb(X, "`x'")]
}
else {
matrix `prefix'_`x' = _b[`x']
}
}
end // of reg2mat
At many levels, it is not ideal, as it manipulates with the (global) matrices in Stata memory; most of the time, it is a bad idea, as the programs should only manipulate with objects local to them.
I suspect that what you want to do is addressed, in one way or another, by either omnipowerful margins command, or by an appropriate predict, or by matrix score (which is the low level version of predict). Attributing the effects to a variable only makes sense when your regressors are orthogonal, which only happens in carefully designed and conducted experiments.

Related

Generate “hash” functions programmatically

I have some extremely old legacy procedural code which takes 10 or so enumerated inputs [ i0, i1, i2, ... i9 ] and generates 170 odd enumerated outputs [ r0, r1, ... r168, r169 ]. By enumerated, I mean that each individual input & output has its own set of distinct value sets e.g. [ red, green, yellow ] or [ yes, no ] etc.
I’m putting together the entire state table using the existing code, and instead of puzzling through them by hand, I was wondering if there was an algorithmic way of determining an appropriate function to get to each result from the 10 inputs. Note, not all input columns may be required to determine an individual output column, i.e. r124 might only be dependent on i5, i6 and i9.
These are not continuous functions, and I expect I might end up with some sort of hashing function approach, but I wondered if anyone knew of a more repeatable process I should be using instead? (If only there was some Karnaugh map like approach for multiple value non-binary functions ;-) )
If you are willing to actually enumerate all possible input/output sequences, here is a theoretical approach to tackle this that should be fairly effective.
First, consider the entropy of the output. Suppose that you have n possible input sequences, and x[i] is the number of ways to get i as an output. Let p[i] = float(x[i])/float(n[i]) and then the entropy is - sum(p[i] * log(p[i]) for i in outputs). (Note, since p[i] < 1 the log(p[i]) is a negative number, and therefore the entropy is positive. Also note, if p[i] = 0 then we assume that p[i] * log(p[i]) is also zero.)
The amount of entropy can be thought of as the amount of information needed to predict the outcome.
Now here is the key question. What variable gives us the most information about the output per information about the input?
If a particular variable v has in[v] possible values, the amount of information in specifying v is log(float(in[v])). I already described how to calculate the entropy of the entire set of outputs. For each possible value of v we can calculate the entropy of the entire set of outputs for that value of v. The amount of information given by knowing v is the entropy of the total set minus the average of the entropies for the individual values of v.
Pick the variable v which gives you the best ratio of information_gained_from_v/information_to_specify_v. Your algorithm will start with a switch on the set of values of that variable.
Then for each value, you repeat this process to get cascading nested if conditions.
This will generally lead to a fairly compact set of cascading nested if conditions that will focus on the input variables that tell you as much as possible, as quickly as possible, with as few branches as you can manage.
Now this assumed that you had a comprehensive enumeration. But what if you don't?
The answer to that is that the analysis that I described can be done for a random sample of your possible set of inputs. So if you run your code with, say, 10,000 random inputs, then you'll come up with fairly good entropies for your first level. Repeat with 10,000 each of your branches on your second level, and the same will happen. Continue as long as it is computationally feasible.
If there are good patterns to find, you will quickly find a lot of patterns of the form, "If you put in this that and the other, here is the output you always get." If there is a reasonably short set of nested ifs that give the right output, you're probably going to find it. After that, you have the question of deciding whether to actually verify by hand that each bucket is reliable, or to trust that if you couldn't find any exceptions with 10,000 random inputs, then there are none to be found.
Tricky approach for the validation. If you can find fuzzing software written for your language, run the fuzzing software with the goal of trying to tease out every possible internal execution path for each bucket you find. If the fuzzing software decides that you can't get different answers than the one you think is best from the above approach, then you can probably trust it.
Algorithm is pretty straightforward. Given possible values for each input we can generate all the input vectors possible. Then per each output we can just eliminate these inputs that do no matter for the output. As the result we for each output we can get a matrix showing output values for all the input combinations excluding the inputs that do not matter for given output.
Sample input format (for code snipped below):
var schema = new ConvertionSchema()
{
InputPossibleValues = new object[][]
{
new object[] { 1, 2, 3, }, // input #0
new object[] { 'a', 'b', 'c' }, // input #1
new object[] { "foo", "bar" }, // input #2
},
Converters = new System.Func<object[], object>[]
{
input => input[0], // output #0
input => (int)input[0] + (int)(char)input[1], // output #1
input => (string)input[2] == "foo" ? 1 : 42, // output #2
input => input[2].ToString() + input[1].ToString(), // output #3
input => (int)input[0] % 2, // output #4
}
};
Sample output:
Leaving the heart of the backward conversion below. Full code in a form of Linqpad snippet is there: http://share.linqpad.net/cknrte.linq.
public void Reverse(ConvertionSchema schema)
{
// generate all possible input vectors and record the resul for each case
// then for each output we could figure out which inputs matters
object[][] inputs = schema.GenerateInputVectors();
// reversal path
for (int outputIdx = 0; outputIdx < schema.OutputsCount; outputIdx++)
{
List<int> inputsThatDoNotMatter = new List<int>();
for (int inputIdx = 0; inputIdx < schema.InputsCount; inputIdx++)
{
// find all groups for input vectors where all other inputs (excluding current) are the same
// if across these groups outputs are exactly the same, then it means that current input
// does not matter for given output
bool inputMatters = inputs.GroupBy(input => ExcudeByIndexes(input, new[] { inputIdx }), input => schema.Convert(input)[outputIdx], ObjectsByValuesComparer.Instance)
.Where(x => x.Distinct().Count() > 1)
.Any();
if (!inputMatters)
{
inputsThatDoNotMatter.Add(inputIdx);
Util.Metatext($"Input #{inputIdx} does not matter for output #{outputIdx}").Dump();
}
}
// mapping table (only inputs that matters)
var mapping = new List<dynamic>();
foreach (var inputGroup in inputs.GroupBy(input => ExcudeByIndexes(input, inputsThatDoNotMatter), ObjectsByValuesComparer.Instance))
{
dynamic record = new ExpandoObject();
object[] sampleInput = inputGroup.First();
object output = schema.Convert(sampleInput)[outputIdx];
for (int inputIdx = 0; inputIdx < schema.InputsCount; inputIdx++)
{
if (inputsThatDoNotMatter.Contains(inputIdx))
continue;
AddProperty(record, $"Input #{inputIdx}", sampleInput[inputIdx]);
}
AddProperty(record, $"Output #{outputIdx}", output);
mapping.Add(record);
}
// input x, ..., input y, output z form is needed
mapping.Dump();
}
}

Assignment problems with simple random number generation in Modelica

I am relatively new to Modelica (Dymola-environment) and I am getting very desperate/upset that I cannot solve such a simple problem as a random number generation in Modelica and I hope that you can help me out.
The simple function random produces a random number between 0 and 1 with an input seed seedIn[3] and produces the output seed seedOut[3] for the next time step or event. The call
(z,seedOut) = random(seedIn);
works perfectly fine.
The problem is that I cannot find a way in Modelica to compute this assignment over time by using the seedOut[3] as the next seedIn[3], which is very frustrating.
My simple program looks like this:
*model Randomgenerator
Real z;
Integer seedIn[3]( start={1,23,131},fixed=true), seedOut[3];
equation
(z,seedOut) = random(seedIn);
algorithm
seedIn := seedOut;
end Randomgenerator;*
I have tried nearly all possibilities with algorithm assignments, initial conditions and equations but none of them works. I just simply want to use seedOut in the next time step. One problem seems to be that when entering into the algorithm section, neither the initial conditions nor the values from the equation section are used.
Using the 'sample' and 'reinit' functions the code below will calculate a new random number at the frequency specified in 'sample'. Note the way of defining the "start value" of seedIn.
model Randomgenerator
Real seedIn[3] = {1,23,131};
Real z;
Real[3] seedOut;
equation
(z,seedOut) = random(seedIn);
when sample(1,1) then
reinit(seedIn,pre(seedOut));
end when;
end Randomgenerator;
The 'pre' function allows the use of the previous value of the variable. If this was not used, the output 'z' would have returned a constant value. Two things regarding the 'reinint' function, it requires use of 'when' and requires 'Real' variables/expressions hence seedIn and seedOut are now defined as 'Real'.
The simple "random" generator I used was:
function random
input Real[3] seedIn;
output Real z;
output Real[3] seedOut;
algorithm
seedOut[1] :=seedIn[1] + 1;
seedOut[2] :=seedIn[2] + 5;
seedOut[3] :=seedIn[3] + 10;
z :=(0.1*seedIn[1] + 0.2*seedIn[2] + 0.3*seedIn[3])/(0.5*sum(seedIn));
end random;
Surely there are other ways depending on the application to perform this operation. At least this will give you something to start with. Hope it helps.

Creating matrix of "concord" results

I have matrix with 400 rows and 40 columns.
I would like to create a new matrix from this data where I calculate the concordance between 2 variables, i.e., concord [A1,B1]=number1; concord [A1,B2]=number2; [A1,B39]=number39. So, number1 should now be the first number of the first row of a new matrix; number 2 is the second number in the first row....
The end result is a new matrix that shows the rho_c for each pair of numbers in the original data matrix.
The original matrix has a lot of empty cells. I can also create multiple matrix of subsections of concordance calculations, it doesn't matter much. However, I don't quite understand how to write this command in mata.
I've searched here: http://jasoneichorst.com/wp-content/uploads/2012/01/BeginMatrix.pdf
EDIT: The data looks like this (variable "Score1" is a rater). Not all raters rate the same item.
enter image description here
Assuming I fully understand the question, there are methods to do this. One which comes to mind involves the use of concord available from SSC (ssc install concord) along with some local macros and loops.
/* Clear and set up sample data */
clear *
set obs 60
forvalues i = 1/6 {
gen A`i' = runiform()
}
replace A2 = . in 10/L
replace A3 = . in 1/5
replace A3 = . in 20/L
replace A4 = . in 1/20
replace A4 = . in 30/L
replace A5 = . in 1/15
replace A5 = . in 40/L
replace A6 = . in 1/40
/* End data set-up */
* describe, varlist will allow you to store your variables in a local macro
qui describe, varlist
local vars `r(varlist)'
* get number of variables in local macro vars
local varcount : word count `vars'
* Create a matrix to hold rho_c
mat rho = J(6,6,.)
mat rownames rho = `vars'
mat colnames rho = `vars'
* Loop through vars to run concord on all unique combinations of A1-A6
* using the position of each variable in local vars to assign the var name
* to local x and local y
* concord is executed only for j >= i so that you don't end up with two sets
* of the same variables being ran (eg., A1,A2 and A2,A1)
forvalues i = 1/`varcount' {
local y `: word `i' of `vars''
forvalues j = 1/`varcount' {
local x `: word `j' of `vars''
if `j' >= `i' {
capture noisily concord `y' `x'
mat rho[`i',`j'] = r(rho_c)
}
}
}
* Display the results stored in the matrix, rho.
mat list rho
The above code should get you started, but there may need to be changes made depending on exactly what you want to do.
You will notice that inside of the loop, I have included capture noisily before concord. The reason for this is because in the image you linked to, your variables were missing values across entire sections of observations. This will likely result in an error message being thrown (specifically, r(2000): no observations). The capture piece forces Stata to continue to execute the loop if an error occurs there. The noisily piece tells Stata to display the output from concord even though capture was specified.
Also, if you search help concord in Stata, you will be directed to the help page which indicates that the concordance correlation coefficient is stored in r(rho_c). You can store these as individual scalars inside the loop or do as in the example and create a kxk matrix of values.

Stata: why is my matrix not clearing over the foreach loop

When I run the following code, the two output matrices (diffInDiffOne & diffInDiffTwo) are the same. My guess is that coeffs is not being replaced after each loop but I have no idea why . I think that the coefficients matrix is being overwritten but I have no idea how. I tried changing the for loop order but this surprisingly didn't solve my issue either:
local treatments treat_one treat_two
matrix diffInDiffOne = J(1,9,.)
matrix diffInDiffTwo = J(1,9,.)
foreach treatment in `treatments' {
reg science inSchool#`treatment'#male
matrix coeffs=e(b)
if treat_one==`treatment'{
matrix diffInDiffOne = diffInDiffOne\coeffs
}
if treat_two==`treatment'{
matrix diffInDiffTwo = diffInDiffTwo\coeffs
}
}
matrix list diffInDiffOne
matrix list diffInDiffTwo
When I list the matrix they are both the same, depsite the fact that two regressions give different answers. Any help with this issue is much appreciated. Thanks
This code appears at first sight to reduce to
reg science inSchool#treat_one#male
matrix li e(b)
reg science inSchool#treat_two#male
matrix li e(b)
apart from the detail of adding nine missing values to the matrix.
However, that is not your code, so what is biting you? I guess at something much more subtle.
You should need to be very careful with the if command. Variables evaluated in if commands are evaluated in their first observation. So, the first time round the loop
the conditions are
if treat_one[1] == treat_one[1]
if treat_two[1] == treat_one[1]
The second time, it is
if treat_one[1] == treat_two[1]
if treat_two[1] == treat_two[1]
If it is true in your data that treat_one[1] == treat_two[1] the effect will not be as you may imagine.
If you want to test for equality of strings, do something like
if "`treatment'" == "treat_one"
You may have in mind something more like
foreach treatment in treat_one treat_two {
reg science inSchool#`treatment'#male
matrix `treatment' = e(b)
matrix list `treatment`
}
You seem to be wanting to write very complicated code for rather simple problems. A while back, I recommended thinking in terms of do-files rather than programs. That may be advice to reconsider.

Removing a "row" from a structure array

This is similar to a question I asked before, but is slightly different:
So I have a very large structure array in matlab. Suppose, for argument's sake, to simplify the situation, suppose I have something like:
structure(1).name, structure(2).name, structure(3).name structure(1).returns, structure(2).returns, structure(3).returns (in my real program I have 647 structures)
Suppose further that structure(i).returns is a vector (very large vector, approximately 2,000,000 entries) and that a condition comes along where I want to delete the jth entry from structure(i).returns for all i. How do you do this? or rather, how do you do this reasonably fast? I have tried some things, but they are all insanely slow (I will show them in a second) so I was wondering if the community knew of faster ways to do this.
I have parsed my data two different ways; the first way had everything saved as cell arrays, but because things hadn't been working well for me I parsed the data again and placed everything as vectors.
What I'm actually doing is trying to delete NaN data, as well as all data in the same corresponding row of my data file, and then doing the very same thing after applying the Hampel filter. The relevant part of my code in this attempt is:
for i=numStock+1:-1:1
for j=length(stock(i).return):-1:1
if(isnan(stock(i).return(j)))
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end
end
stock(i).return = sort(stock(i).return);
stock(i).returnLength = length(stock(i).return);
stock(i).medianReturn = median(stock(i).return);
stock(i).madReturn = mad(stock(i).return,1);
end;
for i=numStock:-1:1
for j = length(stock(i+1).volume):-1:1
if(isnan(stock(i+1).volume(j)))
for k=numStock:-1:1
stock(k+1).volume(j) = [];
end
end
end
stock(i+1).volume = sort(stock(i+1).volume);
stock(i+1).volumeLength = length(stock(i+1).volume);
stock(i+1).medianVolume = median(stock(i+1).volume);
stock(i+1).madVolume = mad(stock(i+1).volume,1);
end;
for i=numStock+1:-1:1
for j=stock(i).returnLength:-1:1
if (abs(stock(i).return(j) - stock(i).medianReturn) > 3*stock(i).madReturn)
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end;
end;
end;
for i=numStock:-1:1
for j=stock(i+1).volumeLength:-1:1
if (abs(stock(i+1).volume(j) - stock(i+1).medianVolume) > 3*stock(i+1).madVolume)
for k=numStock:-1:1
stock(k+1).volume(j) = [];
end
end;
end;
end;
However, this returns an error:
"Matrix index is out of range for deletion.
Error in Failure (line 110)
stock(k).return(j) = [];"
So instead I tried by parsing everything in as vectors. Then I decided to try and delete the appropriate entries in the vectors prior to building the structure array. This isn't returning an error, but it is very slow:
%% Delete bad data, Hampel Filter
% Delete bad entries
id=strcmp(returns,'');
returns(id)=[];
volume(id)=[];
date(id)=[];
ticker(id)=[];
name(id)=[];
permno(id)=[];
sp500(id) = [];
id=strcmp(returns,'C');
returns(id)=[];
volume(id)=[];
date(id)=[];
ticker(id)=[];
name(id)=[];
permno(id)=[];
sp500(id) = [];
% Convert returns from string to double
returns=cellfun(#str2double,returns);
sp500=cellfun(#str2double,sp500);
% Delete all data for which a return is not a number
nanid=isnan(returns);
returns(nanid)=[];
volume(nanid)=[];
date(nanid)=[];
ticker(nanid)=[];
name(nanid)=[];
permno(nanid)=[];
% Delete all data for which a volume is not a number
nanid=isnan(volume);
returns(nanid)=[];
volume(nanid)=[];
date(nanid)=[];
ticker(nanid)=[];
name(nanid)=[];
permno(nanid)=[];
% Apply the Hampel filter, and delete all data corresponding to
% observations deleted by the filter.
medianReturn = median(returns);
madReturn = mad(returns,1);
for i=length(returns):-1:1
if (abs(returns(i) - medianReturn) > 3*madReturn)
returns(i) = [];
volume(i)=[];
date(i)=[];
ticker(i)=[];
name(i)=[];
permno(i)=[];
end;
end
medianVolume = median(volume);
madVolume = mad(volume,1);
for i=length(volume):-1:1
if (abs(volume(i) - medianVolume) > 3*madVolume)
returns(i) = [];
volume(i)=[];
date(i)=[];
ticker(i)=[];
name(i)=[];
permno(i)=[];
end;
end
As I said, this is very slow, probably because I'm using a for loop on a very large data set; however, I'm not sure how else one would do this. Sorry for the gigantic post, but does anyone have a suggestion as to how I might go about doing what I'm asking in a reasonable way?
EDIT: I should add that getting the vector method to work is probably preferable, since my aim is to put all of the return vectors into a matrix and get all of the volume vectors into a matrix and perform PCA on them, and I'm not sure how I would do that using cell arrays (or even if princomp would work on cell arrays).
EDIT2: I have altered the code to match your suggestion (although I did decide to give up speed and keep with the for-loops to keep with the structure array, since reparsing this data will be way worse time-wise). The new code snipet is:
stock_return = zeros(numStock+1,length(stock(1).return));
for i=1:numStock+1
for j=1:length(stock(i).return)
stock_return(i,j) = stock(i).return(j);
end
end
stock_return = stock_return(~any(isnan(stock_return)), : );
This returns an Index exceeds matrix dimensions error, and I'm not sure why. Any suggestions?
I could not find a convenient way to handle structures, therefore I would restructure the code so that instead of structures it uses just arrays.
For example instead of stock(i).return(j) I would do stock_returns(i,j).
I show you on a part of your code how to get rid of for-loops.
Say we deal with this code:
for j=length(stock(i).return):-1:1
if(isnan(stock(i).return(j)))
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end
end
Now, the deletion of columns with any NaN data goes like this:
stock_return = stock_return(:, ~any(isnan(stock_return)) );
As for the absolute difference from medianVolume, you can write a similar code:
% stock_return_length is a scalar
% stock_median_return is a column vector (eg. [1;2;3])
% stock_mad_return is also a column vector.
median_return = repmat(stock_median_return, stock_return_length, 1);
is_bad = abs(stock_return - median_return) > 3.* stock_mad_return;
stock_return = stock_return(:, ~any(is_bad));
Using a scalar for stock_return_length means of course that the return lengths are the same, but you implicitly assume it in your original code anyway.
The important point in my answer is using any. Logical indexing is not sufficient in itself, since in your original code you delete all the values if any of them is bad.
Reference to any: http://www.mathworks.co.uk/help/matlab/ref/any.html.
If you want to preserve the original structure, so you stick to stock(i).return, you can speed-up your code using essentially the same scheme but you can only get rid of one less for-loop, meaning that your program will be substantially slower.

Resources