constrained regression with many variables - matrix

I have around 200 dummies, and wish to run a constrained OLS regression where I impose that the sum of all coefficients on the dummies is equal to 1.
One option is to type:
constraint define 1 dummy_1+dummy_2 +...+dummy_200=1
cnsreg y x_1 x_2 dummy_1-dummy_200, c(1)
...but typing the constraint out would obviously be very painful.
Is there a way to quickly define such a large constraint? The matrix form would be very quick and straightforward, but after much reading online and in Stata guide, it is not clear to me how to do constraints in matrix form, and if they are even possible.

There are at least two sides to this, how to do it and whether it will work in any statistical sense.
How to do it seems easier than you fear as the difficult bit is just inserting "+" signs between the variable names, and that's string manipulation. Something like
unab myvars : dummy_*
local myvars : subinstr local myvars " " "+", all
mac li
constraint 1 `myvars' = 1
should get you started. The macro list is so you can see what you did, especially if it is not what you want.
Whether it will work for you statistically is outside the scope of this forum, but if that's the only constraint note that it's consistent with all kinds of negative and positive coefficients. Perhaps there are special features of your problem that make it a natural constraint, but my intuition is that such a model will be hard to estimate.

I would take a completely different approach. Such constraints typically occur when trying out a different coding scheme for a set of indicator variables. If that is the case then I would use Stata's factor variables, combined with margins with the contrast operators.

Related

Generate a Random number in Uppaal

My question is Can I generate a random number in Uppaal?
I would like to generate a number from a range of values. Even more, I would like to generate not just integers I would like to generate double values as well.
for example: double [7.25,18.3]
I found this question that were talking about the same. I tried it.
However, I got this error: syntax error unexpected T_SELECT.
It doesn't work. I'm pretty new in Uppaal world, I would appreciate any help that you can provide me.
Regards,
This is a common and misunderstood question in Uppaal.
Simple answer:
double val; // declaration
val = random(18.3-7.25)+7.25; // use in update, works in SMC (Uppaal v4.1)
Verbose answer:
Uppaal supports symbolic analysis as well as statistical and the treatment and possibilities are radically different. So one has to decide first what kind of analysis is needed. Usually one starts with simple symbolic analysis and then augment with stochastic features, sometimes stochastic behavior needs also to be checked symbolically.
In symbolic analysis (queries A[], A<>, E<>, E[] etc), random is synonymous with non-deterministic, i.e. if the model contains some "random" behavior, then verification should check all of them any way. Therefore such behavior is modelled as non-deterministic choices between edges. It is easy to setup a set of edges over an integer range by using select statement on the edge where a temporary variable is declared and its value can be used in guards, synchronization and update. Symbolic analysis supports only integer data types (no floating point types like double) and continuous ranges over clocks (specified by constraints in guards and invariants).
Statistical analysis (via Monte-Carlo simulations, queries like Pr[...](<> p), E[...](max: var), simulate, etc) supports double types and floating point functions like sin, cos, sqrt, random(MAX) (uniform distribution over [0, MAX)), random_normal(mean, dev) etc. in addition to int data types. Clock variables can also be treated as floating point type, except that their derivative is set to 1 by default (can be changed in the invariants which allow ODEs -- ordinary differential equations).
It is possible to create models with floating point operations (including random) and still apply symbolic analysis provided that the floating point variables do not influence/constrain the model behavior, and act merely as a cost function over the state space. Here are systematic rules to achieve this:
a) the clocks used in ODEs must be declared of hybrid clock type.
b) hybrid clock and double type variables cannot appear in guard and invariant constraints. Only ODEs are allowed over the hybrid clocks in the invariant.

How I predict how some formula will behave with integers?

I am making some software that need to work with integers.
Also I need to apply some formula to those integers, repeatedly over time (example, do x/=z several times in a row for a indefinite amount).
All tools, algorithms and formulas I could think or find, or don't work with integers at all, or work as approximations at best.
For example the x/=z several times in a row for example, you can theoretically calculate what x will be in the 10th time by doing x = x/(z^10), but that will be wrong if the result is fractional, you can use floor(x/(z^10)), but the result will STILL be wrong.
Plotting software that I found also don't have integers at all, or has "floor()/ceil()" functions support, at best, and still the result would fall in the problem of the previous paragraph.
So how I do it?
Here's something to get you going for the iteration of x/=z:
(that should have ended in "all three terms are 0 with regard to integer division")
Now if x or z are negative, you can try and see whether this still holds; I did not invest the time to make the necessary case distinctions, but they should be fairly analogous.
As Karoly Horvath mentions in a comment, without a clear specification of the kinds of functions for which you would like to find a shortcut to replace iterative evaluation, helping you out won't be possible since there are uncountably many functions over the integers, and the same approach won't work for all of them.

Optimization Toolbox (fmincon) - How to set logical constraints?

Hello all :) I'm pretty new to Optimization and barely understand it (was about ready to slit my wrist after figuring out how to write Objective Functions without any formal learning on the matter), and need a little help on a work project.
How would I go about setting a logical constraint when using the Optimization Toolbox, fmincon specifically (using Trust Region Reflective algorithm)?
I am optimizing 5 values (lets call it matrix OptMat), and I want to optimize with the constraint such that
max(OptMat)/min(OptMat) > 10
I assume this will optimize the 5 values of OptMat as low as possible, while keeping the above constraint in mind so that if a set of values for OptMat is found with a lower OF in which it breaks the constraint it will NOT report those values and instead report the next lowest OF where OptMat values meet the above constraint?
For the record, my lower bounds are [0,0,0,0,0]. I'm not sure how to enter it into upper bounds as it only accepts doubles and that would be logical. I tried the Active Set Algorithm and that enabled the Nonlinear Constraint Function box and I think I'm on the right track with that. If so, I'm not sure what the syntax for entering my desired constraint. Another method^that ^may ^or ^may ^not ^work I could think of is using this as an Upper Boundary.
[min(OptMat)*10, min(OptMat)*10, min(OptMat)*10, min(OptMat)*10, min(OptMat)*10]
Again, I'm using the GUI Optimization Toolbox. I haven't looked too much into command line optimization (though I will need to write it command line eventually) and I think I read somewhere that you can set the Upper Boundary and it does not have to be double?
Thank you so very much for the help, if someone is able. I apologize if this is a really nooby question.
What you are looking for are nonlinear constraints, fmincon can handle it (I only know the command, not the GUI) with the argument nonlcon. For more information look at this guide http://de.mathworks.com/help/optim/ug/fmincon.html
How would you implement this? First create a function
function [c, ceq] = mycondition(x)
c = -max(x)/min(x)/10;
ceq = 0;
I had to change the equation to match the correct formalism, i.e. c(x)<=0 is needed.
Maybe you could also create an anonymous function, I'm not sure (http://de.mathworks.com/help/matlab/matlab_prog/anonymous-functions.html).
Then use this function to feed the fmincon function using the # sign, i.e. at the specific location write
fmincon(...., #mycondition, ...)

Why does accessing coefficients following estimation with nl require slightly different syntax than for other estimation commands?

Following most estimation commands in Stata (e.g. reg, logit, probit, etc.) one may access the estimates using the _b[ParameterName] syntax (or the synonymous _coef[ParameterName]). For example:
regress y x
followed by
di _b[x]
will display the estimate of the coefficient of x. di _b[_cons] will display the coefficient of the estimated intercept (assuming the regress command was successful), etc.
But if I use the nonlinear least squares command nl I (seemingly) have to do something slightly different. Now (leaving aside that for this example model there is absolutely no need to use a NLLS regression):
nl (y = {_cons} + {x}*x)
followed by (notice the forward slash)
di _b[/x]
will display the estimate of the coefficient of x.
Why does accessing parameter estimates following nl require a different syntax? Are there subtleties to be aware of?
"leaving aside that for this example model there is absolutely no need to use a NLLS regression": I think that's what you can't do here....
The question is about why the syntax is as it is. That's a matter of logic and a matter of history. Why a particular syntax was chosen is ultimately a question for the programmers at StataCorp who chose it. Here is one limited take on your question.
The main syntax for regression-type models grows out of a syntax designed for linear regression models in which by default the parameters include an intercept, as you know.
The original syntax for nonlinear regression models (in the sense of being estimated by nonlinear least-squares) matches a need to estimate a bundle of parameters specified by the user, which need not include an intercept at all.
Otherwise put, there is no question of an intercept being a natural default; no parameterisation is a natural default and each model estimated by nl is sui generis.
A helpful feature is that users can choose the names they find natural for the parameters, within the constraints of what counts as a legal name in Stata, say alpha, beta, gamma, a, b, c, etc. If you choose _cons for the intercept in nl that is a legal name but otherwise not special and just your choice; nl won't take it as a signal that it should flip into using regress conventions.
The syntax you cite is part of what was made possible by a major redesign of nl but it is consistent with the original philosophy.
That the syntax is different because it needs to be may not be the answer you seek, but I guess you'll get a fuller answer only from StataCorp; developers do hang out on Statalist, but they don't make themselves visible here.

Maximum difference between columns using relational algebra

Is it possible to obtain the maximum difference between two columns (for example starting and ending weights)?
Right now I'm leaning towards no as this would require a new column with the difference between the two columns for each row, then taking the max of that. Doing it the way I orginally intended doesn't work either since arithmetic operations are not allowed in the conditions of select operations (e.g. SIGMA (c1 - c2 < c3 - c4)(Table) is not allowed).
Disclosure: this is part of a homework question.
It can be done, exactly in the way you planned, but you need generalized projection for that. The generalized projection is the operator
Π(E1, E2,..., En)R
where R is a relation, and E1...En are expressions in the form a⊕b, where a and b are attributes of R or constants, and ⊕ is an arbitrary binary operator between them. The result is a relation with attributes E1...En.
This would allow you to project the differences into a new relation (R' := Π(x-y)R), then find the maximum on that, just as you planned.
If we're not allowed to use generalized projection, then I think we have no means to actually subtract an attribute from another, or to actually calculate anything from them, as the definition of projection allow only attribute names, and the definition of selection allow only expressions of the form aθb where a and b are attributes or constants and θ is a binary relational operator (this is logical, in its way, because if we have a relation R(X,Y), then we have no idea about the type of X or Y, making operations on them quite meaningless).
I think generalized projection is a great extension to relational algebra. It's obviously immensely useful in real life, and it can be defended even from a more scientific point of view: if we allow binary conditional operators on the values like "X > 50", then we made assumptions on the type already, rendering that point kind of moot. Your instructor may disagree, though.
If you're looking to do this in the real world, you should be able to do this with a subquery (or a view, which amounts to much the same thing), something like:
select max (diff) from (
select high - low as diff from blah blah blah
)
Whether this applies to the abstract world of relational algebra, I couldn't say. I'm too busy fixing those damn real-world problems :-)

Resources