Where is DropWhile in Mathematica? - wolfram-mathematica

Mathematica 6 added TakeWhile, which has the syntax:
TakeWhile[list, crit]
gives elements ei from the beginning of list, continuing so long as crit[ei] is True.
There is however no corresponding "DropWhile" function. One can construct DropWhile using LengthWhile and Drop, but it almost seems as though one is discouraged from using DropWhile. Why is this?
To clarify, I am not asking for a way to implement this function. Rather: why is it not already present? It seems to me that there must be a reason for its absence other than an oversight, or it would have been corrected by now. Is there something inefficient, undesirable, or superfluous about DropWhile?
There appears to be some ambiguity about the function of DropWhile, so here is an example:
DropWhile = Drop[#, LengthWhile[#, #2]] &;
DropWhile[{1,2,3,4,5}, # <= 3 &]
Out= {4, 5}

Just a blind guess.
There are a lot list operations that could take a while criteria. For example:
Total..While
Accumulate..While
Mean..While
Map..While
Etc..While
They are not difficult to construct, anyway.
I think those are not included just because the number of "primitive" functions is already growing too long, and the criteria of "is it frequently needed and difficult to implement with good performance by the user?" is prevailing in those cases.

The ubiquitous Lists in Mathematica are fixed length vectors, and when they are of a machine numbers it is a packed array.
Thus the natural functions for a recursively defined linked list (e.g. in Lisp or Haskell) are not the primary tools in Mathematica.
So I am inclined to think this explains why Wolfram did not fill out its repertoire of manipulation functions.

Related

How I predict how some formula will behave with integers?

I am making some software that need to work with integers.
Also I need to apply some formula to those integers, repeatedly over time (example, do x/=z several times in a row for a indefinite amount).
All tools, algorithms and formulas I could think or find, or don't work with integers at all, or work as approximations at best.
For example the x/=z several times in a row for example, you can theoretically calculate what x will be in the 10th time by doing x = x/(z^10), but that will be wrong if the result is fractional, you can use floor(x/(z^10)), but the result will STILL be wrong.
Plotting software that I found also don't have integers at all, or has "floor()/ceil()" functions support, at best, and still the result would fall in the problem of the previous paragraph.
So how I do it?
Here's something to get you going for the iteration of x/=z:
(that should have ended in "all three terms are 0 with regard to integer division")
Now if x or z are negative, you can try and see whether this still holds; I did not invest the time to make the necessary case distinctions, but they should be fairly analogous.
As Karoly Horvath mentions in a comment, without a clear specification of the kinds of functions for which you would like to find a shortcut to replace iterative evaluation, helping you out won't be possible since there are uncountably many functions over the integers, and the same approach won't work for all of them.

most readable way in XPath to write "is value X a member of sequence S"?

XPath 2.0 has some new functions and syntax, relative to 1.0, that work with sequences. Some of theset don't really add to what the language could already do in 1.0 (with node sets), but they make it easier to express the desired logic in ways that are more readable. This increases the chances of the programmer getting the code correct -- and keeping it that way. For example,
empty(s) is equivalent to not(s), but its intent is much clearer when you want to test whether a sequence is empty.
Correction: the effective boolean value of a sequence is in general more complicated than that. E.g. empty((0)) != not((0)). This applies to exists(s) vs. s in a boolean context as well. However, there are domains of s where empty(s) is equivalent to not(s), so the two could be used interchangeably within those domains. But this goes to show that the use of empty() can make a non-trivial difference in making code easier to understand.
Similarly, exists(s) is equivalent to boolean(s) that already existed in XPath 1.0 (or just s in a boolean context), but again is much clearer about the intent.
Quantified expressions; e.g. "some $x in expression satisfies test($x)" would be equivalent to boolean(expression[test(.)]) (although the new syntax is more flexible, in that you don't need to worry about losing the context item because you have the variable to refer to it by).
Similarly, "every $x in expression satisfies test($x)" would be equivalent to not(expression[not(test(.))]) but is more readable.
These functions and syntax were evidently added at no small cost, solely to serve the goal of writing XPath that is easier to map to how humans think. This implies, as experienced developers know, that understandable code is significantly superior to code that is difficult to understand.
Given all that ... what would be a clear and readable way to write an XPath test expression that asks
Does value X occur in sequence S?
Some ways to do it: (Note: I used X and S notation here to indicate the value and the sequence, but I don't mean to imply that these subexpressions are element name tests, nor that they are simple expressions. They could be complicated.)
X = S: This would be one of the most unreadable, since it requires the reader to
think about which of X and S are sequences vs. single values
understand general comparisons, which are not obvious from the syntax
However, one advantage of this form is that it allows us to put the topic (X) before the comment ("is a member of S"), which, I think, helps in readability.
See also CMS's good point about readability, when the syntax or names make the "cardinality" of X and S obvious.
index-of(S, X): This one is clear about what's intended as a value and what as a sequence (if you remember the order of arguments to index-of()). But it expresses more than we need to: it asks for the index, when all we really want to know is whether X occurs in S. This is somewhat misleading to the reader. An experienced developer will figure out what's intended, with some effort and with understanding of the context. But the more we rely on context to understand the intent of each line, the more understanding the code becomes a circular (spiral) and potentially Sisyphean task! Also, since index-of() is designed to return a list of all the indexes of occurrences of X, it could be more expensive than necessary: a smart processor, in order to evaluate X = S, wouldn't necessarily have to find all the contents of S, nor enumerate them in order; but for index-of(S, X), correct order would have to be determined, and all contents of S must be compared to X. One other drawback of using index-of() is that it's limited to using eq for comparison; you can't, for example, use it to ask whether a node is identical to any node in a given sequence.
Correction: This form, used as a conditional test, can result in a runtime error: Effective boolean value is not defined for a sequence of two or more items starting with a numeric value. (But at least we won't get wrong boolean values, since index-of() can't return a zero.) If S can have multiple instances of X, this is another good reason to prefer form 3 or 6.
exists(index-of(X, S)): makes the intent clearer, and would help the processor eliminate the performance penalty if the processor is smart enough.
some $m in S satisfies $m eq X: This one is very clear, and matches our intent exactly. It seems long-winded compared to 1, and that in itself can reduce readability. But maybe that's an acceptable price for clarity. Keep in mind that X and S could potentially be complex expressions themselves -- they're not necessarily just variable references. An advantage is that since the eq operator is explicit, you can replace it with is or any other comparison operator.
S[. eq X]: clearer than 1, but shares the semantic drawbacks of 2: it computes all members of S that are equal to X. Actually, this could return a false negative (incorrect effective boolean value), if X is falsy. E.g. (0, 1)[. eq 0] returns 0 which is falsy, even though 0 occurs in (0, 1).
exists(S[. eq X]): Clearer than 1, 2, 3, and 5. Not as clear as 4, but shorter. Avoids the drawbacks of 5 (or at least most of them, depending on the processor smarts).
I'm kind of leaning toward the last one, at this point: exists(S[. eq X])
What about you... As a developer coming to a complex, unfamiliar XSLT or XQuery or other program that uses XPath 2.0, and wanting to figure out what that program is doing, which would you find easiest to read?
Apologies for the long question. Thanks for reading this far.
Edit: I changed = to eq wherever possible in the above discussion, to make it easier to see where a "value comparison" (as opposed to a general comparison) was intended.
For what it's worth, if names or context make clear that X is a singleton, I'm happy to use your first form, X = S -- for example when I want to check an attribute value against a set of possible values:
<xsl:when test="#type = ('A', 'A+', 'A-', 'B+')" />
or
<xsl:when test="#type = $magic-types"/>
If I think there is a risk of confusion, then I like your sixth formulation. The less frequently I have to remember the rules for calculating an effective boolean value, the less frequently I make a mistake with them.
I prefer this one:
count(distinct-values($seq)) eq count(distinct-values(($x, $seq)))
When $x is itself a sequence, this expression implements the (value-based) subset of relation between two sets of values, that are represented as sequences. This implementation of subset of has just linear time complexity -- vs many other ways of expressing this, that have O(N^2)) time complexity.
To summarize, the question whether a single value belongs to a set of values is a special case of the question whether one set of values is a subset of another. If we have a good implementation of the latter, we can simply use it for answering the former.
The functx library has a nice implementation of this function, so you can use
functx:is-node-in-sequence($X, $Y)
(this particular function can be found at http://www.xqueryfunctions.com/xq/functx_is-node-in-sequence.html)
The whole functx library is available for both XQuery (http://www.xqueryfunctions.com/) and XSLT (http://www.xsltfunctions.com/)
Marklogic ships the functx library with their core product; other vendors may also.
Another possibility, when you want to know whether node X occurs in sequence S, is
exists((X) intersect S)
I think that's pretty readable, and concise. But it only works when X and the values in S are nodes; if you try to ask
exists(('bob') intersect ('alice', 'bob'))
you'll get a runtime error.
In the program I'm working on now, I need to compare strings, so this isn't an option.
As Dimitri notes, the occurrence of a node in a sequence is a question of identity, not of value comparison.

How does addition work in racket/scheme?

For example, if you try (+ 3 4), how is it broken down and calculated in the source, specifically? Does it use recursion with add1?
The implementation of + is actually much more complicated than you might expect, because arithmetic is generic in Racket: it works on integers, rational numbers, complex numbers, and so on. You can even mix and match these kinds of numbers and it'll do the right thing. In the end, it's ultimately going to use arithmetic in C, which is what the runtime system is written in.
If you're curious, you can find more of the guts of the numeric tower here: https://github.com/plt/racket/blob/master/src/racket/src/numarith.c
Other pointers: Bignum arithmetic, the Scheme numeric tower, the Racket reference on numbers.
The + operator is a primitive operation, part of the core language. For efficiency reasons, it wouldn't make much sense to implement it as a recursive procedure.

Mathematica's pattern matching poorly optimized?

I recently inquired about why PatternTest was causing a multitude of needless evaluations: PatternTest not optimized? Leonid replied that it is necessary for what seems to me as a rather questionable method. I can accept that, though I would prefer a more efficient alternative.
I now realize, which I believe Leonid has been saying for some time, that this problem runs much deeper in Mathematica, and I am troubled. I cannot understand why this is not or cannot be better optimized.
Consider this example:
list = RandomReal[9, 20000];
Head /# list; // Timing
MatchQ[list, {x__Integer, y__}] // Timing
{0., Null}
{1.014, False}
Checking the heads of the list is essentially instantaneous, yet checking the pattern takes over a second. Surely Mathematica could recognize that since the first element of the list is not an Integer, the pattern cannot match, and unlike the case with PatternTest I cannot see how there is any mutability in the pattern. What is the explanation for this?
There appears to be some confusion regarding packed arrays, which as far as I can tell have no bearing on this question. Rather, I am concerned with the O(n2) time complexity on all lists, packed or unpacked.
MatchQ unpacks for these kinds of tests. The reason is that no special case for this has been implemented. In principle it could contain anything.
On["Packing"]
MatchQ[list, {x_Integer, y__}] // Timing
MatchQ[list, {x__Integer, y__}] // Timing
Improving this is very tricky - if you break the pattern matcher you have a serious problem.
Edit 1:
It is true that the unpacking is not the cause for the O(n^2) complexity. It does, however, show that for the MatchQ[list, {x__Integer, y__}] part the code goes to another part of the algorithm (which needs the lists to be unpacked). Some other things to note: This complexity arises only if both patterns are __ if either one of them is _ the algorithm has a better complexity.
The algorithm then goes through all n*n potential matches and there seems no early bailout. Presumably because other patters could be constructed that would need this complexity - The issue is that the above pattern forces the matcher to a very general algorithm.
I then was hoping for MatchQ[list, {Shortest[x__Integer], __}] and friends but to no avail.
So, my two cents: either use a different pattern (and have On["Packing"] to see if it goes to the general matcher) or do a pre-check DeveloperPackedArrayQ[expr] && Head[expr[[1]]]===Integer or some such.
#the author of the first answer. As far as I know from reverse-engeneering and reading of available information, it may be due to different ways the patterns are checked. In fact - as they say - a special hash code is used for pattern matching. This hash (basically a FNV-1 round) makes it very easy to check for particular patterns related to the type of expression involved (matter of a few xor operations). The hashing algorithm cycles inside the expression and each subpart is xorred with the output of the previous one. Special xor values are used for each atom expression - machineInts, machineReals, bigNums, Rationals and so on. Hence, for example, _Integer is easy to check because the hash of any integer is formed with integer's xor value, so all we need to do is doing the inverse op and see if matches - i.e. if we get some particular value or something like that (sorry if I'm vague on actual implementation details. It's WIP). For general or uncommon patterns the check may not take advantage of this hash stuff and require something different.
#the OP Head[] simply acts on the internal expression, taking the value of the first pointer of the expression (expressions are implemented as arrays of pointers). So doing it is as easy as copying and printing a string - very very fast. The pattern matching engine is not even called in this case.

Optimizing the computation of a recursive sequence

What is the fastest way in R to compute a recursive sequence defined as
x[1] <- x1
x[n] <- f(x[n-1])
I am assuming that the vector x of proper length is preallocated. Is there a smarter way than just looping?
Variant: extend this to vectors:
x[,1] <- x1
x[,n] <- f(x[,n-1])
Solve the recurrence relationship ;)
In terms of the question of whether this can be fully "vectorized" in any way, I think the answer is probably "no". The fundamental idea behind array programming is that operations apply to an entire set of values at the same time. Similarly for questions of "embarassingly parallel" computation. In this case, since your recursive algorithm depends on each prior state, there would be no way to gain speed from parallel processing: it must be run serially.
That being said, the usual advice for speeding up your program applies. For instance, do as much of the calculation outside of your recursive function as possible. Sort everything. Predefine your array lengths so they don't have to grow during the looping. Etc. See this question for a similar discussion. There is also a pseudocode example in Tim Hesterberg's article on efficient S-Plus Programming.
You could consider writing it in C / C++ / Fortran and use the handy inline package to deal with the compiling, linking and loading for you.
Of course, your function f() may be a real constraint if that one needs to remain an R function. There is a callback-from-C++-to-R example in Rcpp but this requires a bit more work than just using inline.
Well if you need the entire sequence how fast it can be? assuming that the function is O(1), you cannot do better than O(n), and looping through will give you just that.
In general, the syntax x$y <- f(z) will have to reallocate x every time, which would be very slow if x is a large object. But, it turns out that R has some tricks so that the list replacement function [[<- doesn't reallocate the whole list every time. So I think you can reasonably efficiently do:
x[[1]] <- x1
for (m in seq(2, n))
x[[m]] <- f(x[[m-1]])
The only wasteful aspect here is that you have to generate an array of length n-1 for the for loop, which isn't ideal, but it's probably not a giant issue. You could replace it by a while loop if you preferred. The usual vectorization tricks (lapply, etc.) won't work here...
(The double brackets give you a list element, which is what you probably want, rather than a singleton list.)
For more details, see Chambers (2008). Software for Data Analysis. p. 473-474.

Resources