Maximum difference between columns using relational algebra - relational-algebra

Is it possible to obtain the maximum difference between two columns (for example starting and ending weights)?
Right now I'm leaning towards no as this would require a new column with the difference between the two columns for each row, then taking the max of that. Doing it the way I orginally intended doesn't work either since arithmetic operations are not allowed in the conditions of select operations (e.g. SIGMA (c1 - c2 < c3 - c4)(Table) is not allowed).
Disclosure: this is part of a homework question.

It can be done, exactly in the way you planned, but you need generalized projection for that. The generalized projection is the operator
Π(E1, E2,..., En)R
where R is a relation, and E1...En are expressions in the form a⊕b, where a and b are attributes of R or constants, and ⊕ is an arbitrary binary operator between them. The result is a relation with attributes E1...En.
This would allow you to project the differences into a new relation (R' := Π(x-y)R), then find the maximum on that, just as you planned.
If we're not allowed to use generalized projection, then I think we have no means to actually subtract an attribute from another, or to actually calculate anything from them, as the definition of projection allow only attribute names, and the definition of selection allow only expressions of the form aθb where a and b are attributes or constants and θ is a binary relational operator (this is logical, in its way, because if we have a relation R(X,Y), then we have no idea about the type of X or Y, making operations on them quite meaningless).
I think generalized projection is a great extension to relational algebra. It's obviously immensely useful in real life, and it can be defended even from a more scientific point of view: if we allow binary conditional operators on the values like "X > 50", then we made assumptions on the type already, rendering that point kind of moot. Your instructor may disagree, though.

If you're looking to do this in the real world, you should be able to do this with a subquery (or a view, which amounts to much the same thing), something like:
select max (diff) from (
select high - low as diff from blah blah blah
)
Whether this applies to the abstract world of relational algebra, I couldn't say. I'm too busy fixing those damn real-world problems :-)

Related

Why does accessing coefficients following estimation with nl require slightly different syntax than for other estimation commands?

Following most estimation commands in Stata (e.g. reg, logit, probit, etc.) one may access the estimates using the _b[ParameterName] syntax (or the synonymous _coef[ParameterName]). For example:
regress y x
followed by
di _b[x]
will display the estimate of the coefficient of x. di _b[_cons] will display the coefficient of the estimated intercept (assuming the regress command was successful), etc.
But if I use the nonlinear least squares command nl I (seemingly) have to do something slightly different. Now (leaving aside that for this example model there is absolutely no need to use a NLLS regression):
nl (y = {_cons} + {x}*x)
followed by (notice the forward slash)
di _b[/x]
will display the estimate of the coefficient of x.
Why does accessing parameter estimates following nl require a different syntax? Are there subtleties to be aware of?
"leaving aside that for this example model there is absolutely no need to use a NLLS regression": I think that's what you can't do here....
The question is about why the syntax is as it is. That's a matter of logic and a matter of history. Why a particular syntax was chosen is ultimately a question for the programmers at StataCorp who chose it. Here is one limited take on your question.
The main syntax for regression-type models grows out of a syntax designed for linear regression models in which by default the parameters include an intercept, as you know.
The original syntax for nonlinear regression models (in the sense of being estimated by nonlinear least-squares) matches a need to estimate a bundle of parameters specified by the user, which need not include an intercept at all.
Otherwise put, there is no question of an intercept being a natural default; no parameterisation is a natural default and each model estimated by nl is sui generis.
A helpful feature is that users can choose the names they find natural for the parameters, within the constraints of what counts as a legal name in Stata, say alpha, beta, gamma, a, b, c, etc. If you choose _cons for the intercept in nl that is a legal name but otherwise not special and just your choice; nl won't take it as a signal that it should flip into using regress conventions.
The syntax you cite is part of what was made possible by a major redesign of nl but it is consistent with the original philosophy.
That the syntax is different because it needs to be may not be the answer you seek, but I guess you'll get a fuller answer only from StataCorp; developers do hang out on Statalist, but they don't make themselves visible here.

constrained regression with many variables

I have around 200 dummies, and wish to run a constrained OLS regression where I impose that the sum of all coefficients on the dummies is equal to 1.
One option is to type:
constraint define 1 dummy_1+dummy_2 +...+dummy_200=1
cnsreg y x_1 x_2 dummy_1-dummy_200, c(1)
...but typing the constraint out would obviously be very painful.
Is there a way to quickly define such a large constraint? The matrix form would be very quick and straightforward, but after much reading online and in Stata guide, it is not clear to me how to do constraints in matrix form, and if they are even possible.
There are at least two sides to this, how to do it and whether it will work in any statistical sense.
How to do it seems easier than you fear as the difficult bit is just inserting "+" signs between the variable names, and that's string manipulation. Something like
unab myvars : dummy_*
local myvars : subinstr local myvars " " "+", all
mac li
constraint 1 `myvars' = 1
should get you started. The macro list is so you can see what you did, especially if it is not what you want.
Whether it will work for you statistically is outside the scope of this forum, but if that's the only constraint note that it's consistent with all kinds of negative and positive coefficients. Perhaps there are special features of your problem that make it a natural constraint, but my intuition is that such a model will be hard to estimate.
I would take a completely different approach. Such constraints typically occur when trying out a different coding scheme for a set of indicator variables. If that is the case then I would use Stata's factor variables, combined with margins with the contrast operators.

most readable way in XPath to write "is value X a member of sequence S"?

XPath 2.0 has some new functions and syntax, relative to 1.0, that work with sequences. Some of theset don't really add to what the language could already do in 1.0 (with node sets), but they make it easier to express the desired logic in ways that are more readable. This increases the chances of the programmer getting the code correct -- and keeping it that way. For example,
empty(s) is equivalent to not(s), but its intent is much clearer when you want to test whether a sequence is empty.
Correction: the effective boolean value of a sequence is in general more complicated than that. E.g. empty((0)) != not((0)). This applies to exists(s) vs. s in a boolean context as well. However, there are domains of s where empty(s) is equivalent to not(s), so the two could be used interchangeably within those domains. But this goes to show that the use of empty() can make a non-trivial difference in making code easier to understand.
Similarly, exists(s) is equivalent to boolean(s) that already existed in XPath 1.0 (or just s in a boolean context), but again is much clearer about the intent.
Quantified expressions; e.g. "some $x in expression satisfies test($x)" would be equivalent to boolean(expression[test(.)]) (although the new syntax is more flexible, in that you don't need to worry about losing the context item because you have the variable to refer to it by).
Similarly, "every $x in expression satisfies test($x)" would be equivalent to not(expression[not(test(.))]) but is more readable.
These functions and syntax were evidently added at no small cost, solely to serve the goal of writing XPath that is easier to map to how humans think. This implies, as experienced developers know, that understandable code is significantly superior to code that is difficult to understand.
Given all that ... what would be a clear and readable way to write an XPath test expression that asks
Does value X occur in sequence S?
Some ways to do it: (Note: I used X and S notation here to indicate the value and the sequence, but I don't mean to imply that these subexpressions are element name tests, nor that they are simple expressions. They could be complicated.)
X = S: This would be one of the most unreadable, since it requires the reader to
think about which of X and S are sequences vs. single values
understand general comparisons, which are not obvious from the syntax
However, one advantage of this form is that it allows us to put the topic (X) before the comment ("is a member of S"), which, I think, helps in readability.
See also CMS's good point about readability, when the syntax or names make the "cardinality" of X and S obvious.
index-of(S, X): This one is clear about what's intended as a value and what as a sequence (if you remember the order of arguments to index-of()). But it expresses more than we need to: it asks for the index, when all we really want to know is whether X occurs in S. This is somewhat misleading to the reader. An experienced developer will figure out what's intended, with some effort and with understanding of the context. But the more we rely on context to understand the intent of each line, the more understanding the code becomes a circular (spiral) and potentially Sisyphean task! Also, since index-of() is designed to return a list of all the indexes of occurrences of X, it could be more expensive than necessary: a smart processor, in order to evaluate X = S, wouldn't necessarily have to find all the contents of S, nor enumerate them in order; but for index-of(S, X), correct order would have to be determined, and all contents of S must be compared to X. One other drawback of using index-of() is that it's limited to using eq for comparison; you can't, for example, use it to ask whether a node is identical to any node in a given sequence.
Correction: This form, used as a conditional test, can result in a runtime error: Effective boolean value is not defined for a sequence of two or more items starting with a numeric value. (But at least we won't get wrong boolean values, since index-of() can't return a zero.) If S can have multiple instances of X, this is another good reason to prefer form 3 or 6.
exists(index-of(X, S)): makes the intent clearer, and would help the processor eliminate the performance penalty if the processor is smart enough.
some $m in S satisfies $m eq X: This one is very clear, and matches our intent exactly. It seems long-winded compared to 1, and that in itself can reduce readability. But maybe that's an acceptable price for clarity. Keep in mind that X and S could potentially be complex expressions themselves -- they're not necessarily just variable references. An advantage is that since the eq operator is explicit, you can replace it with is or any other comparison operator.
S[. eq X]: clearer than 1, but shares the semantic drawbacks of 2: it computes all members of S that are equal to X. Actually, this could return a false negative (incorrect effective boolean value), if X is falsy. E.g. (0, 1)[. eq 0] returns 0 which is falsy, even though 0 occurs in (0, 1).
exists(S[. eq X]): Clearer than 1, 2, 3, and 5. Not as clear as 4, but shorter. Avoids the drawbacks of 5 (or at least most of them, depending on the processor smarts).
I'm kind of leaning toward the last one, at this point: exists(S[. eq X])
What about you... As a developer coming to a complex, unfamiliar XSLT or XQuery or other program that uses XPath 2.0, and wanting to figure out what that program is doing, which would you find easiest to read?
Apologies for the long question. Thanks for reading this far.
Edit: I changed = to eq wherever possible in the above discussion, to make it easier to see where a "value comparison" (as opposed to a general comparison) was intended.
For what it's worth, if names or context make clear that X is a singleton, I'm happy to use your first form, X = S -- for example when I want to check an attribute value against a set of possible values:
<xsl:when test="#type = ('A', 'A+', 'A-', 'B+')" />
or
<xsl:when test="#type = $magic-types"/>
If I think there is a risk of confusion, then I like your sixth formulation. The less frequently I have to remember the rules for calculating an effective boolean value, the less frequently I make a mistake with them.
I prefer this one:
count(distinct-values($seq)) eq count(distinct-values(($x, $seq)))
When $x is itself a sequence, this expression implements the (value-based) subset of relation between two sets of values, that are represented as sequences. This implementation of subset of has just linear time complexity -- vs many other ways of expressing this, that have O(N^2)) time complexity.
To summarize, the question whether a single value belongs to a set of values is a special case of the question whether one set of values is a subset of another. If we have a good implementation of the latter, we can simply use it for answering the former.
The functx library has a nice implementation of this function, so you can use
functx:is-node-in-sequence($X, $Y)
(this particular function can be found at http://www.xqueryfunctions.com/xq/functx_is-node-in-sequence.html)
The whole functx library is available for both XQuery (http://www.xqueryfunctions.com/) and XSLT (http://www.xsltfunctions.com/)
Marklogic ships the functx library with their core product; other vendors may also.
Another possibility, when you want to know whether node X occurs in sequence S, is
exists((X) intersect S)
I think that's pretty readable, and concise. But it only works when X and the values in S are nodes; if you try to ask
exists(('bob') intersect ('alice', 'bob'))
you'll get a runtime error.
In the program I'm working on now, I need to compare strings, so this isn't an option.
As Dimitri notes, the occurrence of a node in a sequence is a question of identity, not of value comparison.

2D PHP Pattern Creator Altgorithm

I'm looking for an algorithm to help me build 2D patterns based on rules. The idea is that I could write a script using a given site of parameters, and it would return a random, 2-dimensional sequence up to a given length.
My plan is to use this to generate image patterns based on rules. Things like image fractals or sprites for game levels could possibly use this.
For example, lets say that you can use A, B, C, & D to create the pattern. The rule is that C and A can never be next to each other, and that D always follows C. Next, lets say I want a pattern of size 4x4. The result might be the following which respects all the rules.
A B C D
B B B B
C D B B
C D C D
Are there any existing libraries that can do calculations like this? Are there any mathematical formulas I can read-up on?
While pretty inefficient concering runtime, backtracking is an often used algorithm for such a problem.
It follows a simple pattern, and if written correctly, you can easily replace a rule set into it.
Define your rule data structures; i.e., define the set of operations that the rules can encapsulate, and define the available cross-referencing that can be done. Once you've done this, you should have a clearer view of what type of algorithms to use to apply these rules to a potential result set.
Supposing that your rules are restricted to "type X is allowed to have type Y immediately to its left/right/top/bottom" you potentially have situations where generating possible patterns is computationally difficult. Take a look at Wang Tiles (a good source is the book Tilings and Patterns by Grunbaum and Shephard) and you'll see that with the states sets of rules you might define sets of Wang Tiles. Appropriate sets of these are Turing Complete.
For small rectangles, or your sets of rules, this may only be of academic interest. As mentioned elsewhere a backtracking approach might be appropriate for your ruleset - in which case you may want to consider appropriate heuristics for the order in which new components are added to your grid. Again, depending on your rulesets, other approaches might work. E.g. if your ruleset admits many solutions you might get a long way by randomly allocating many items to the grid before attempting to fill in remaining gaps.

Relational algebra - what is the proper way to represent a 'having' clause?

This is a small part of a homework question so I can understand the whole.
SQL query to list car prices that occur more than once:
select car_price from cars
group by car_price
having count (car_price) > 1;
The general form of this in relational algebra is
Y (gl, al) R
where Y is the greek symbol, gl is list of attributes to group, and al is a list of aggregations.
The relational algebra:
Y (count(car_price)) cars
How is the having clause written in that statement? Is there a shorthand? If not, do I just need to select from that relation? Like this?
SELECT (count(car_price) > 1) [Y (count(car_price)) cars]
select count(*) from (select * from cars where price > 1) as cars;
also known as relational closure.
For a more or less precise answer to the actual question asked, "Relational algebra - what is the proper way to represent a ‘having’ clause?", it needs to be stated first that the question itself seems to suggest, or presume, that there exists such a thing as "THE" relational algebra, but that presumption is simply untrue !
An algebra is a set of operators, and anyone can define any set of operators he likes, meaning anyone can define any algebra he likes ! In his most recent publication, Hugh Darwen mentions that RESTRICT is not a fundamental operator of the algebra, though lots of others do consider it as such.
Especially with respect to aggregations and summaries, there is little consensus as to how those should be incorporated in a relational algebra. Defining operators such as COUNT() (that take a relation as an argument value and return an integer) as part of the algebra, might be problematic wrt the closure property of the algebra, precisely because such operators do not return a relation ...
So the sorry, but nevertheless most appropriate, answer here seems to be that a conclusive answer to this question is almost impossible to give ...

Resources