A particular paragraph in Deep Learning - Bengio - probability

This question concerns with the chapter on RNNs in the
Deep learning look by Prof Bengio.
In section 10.2.2 on page 336 in the last paragraph, the book talks about
"...because the outputs are the result of a softmax, it must be that the input sequence is a sequence of symbols...".
This seems to suggest that the output is treated as a probability distribution over the possible 'bits' and the next input x(t+1) is sampled using this joint probability distribution over the output bits. Is this interpretation correct?

No, the interpretation is not correct (unless my interpretation of your interpretation is incorrect). x is an input, and it is fixed in advance, so x(t+1) does not depend on the predicted value for timestep t.
In that paragraph he discusses a particular case of an RNN, where y(t) is a prediction of x(t + 1), in other words, the network is trying to predict the next symbol given all the previous symbols.
My understanding of the sentence you are referring to is that since y is a result of a softmax, y has a limited range of values it can assume, and therefore x on itself has to be limited to the same range of values, hence x has to be a "symbol or bounded integer". Otherwise, if x, for instance, is a double, y cannot predict it, since the output of a softmax is a discrete value.
UPDATE: as a matter of fact, Bengio has a great paper:
http://arxiv.org/abs/1506.03099
in which he actually suggests that on some iterations we use y(t) instead of x(t+1) as input when predict y(t+1) during training (which is along the lines of your understanding in your question).

Related

Numerical accuracy of an integer solution to an exponential equation

I have an algorithm that relies on integer inputs x, y and s.
For input checking and raising an exception for invalid arguments I have to make sure that there is a natural number n so that x*s^n=y
Or in words: How often do I have to chain-multiply x with s until I arrive at y.
And more importantly: Do I arrive at y exactly?
This problem can be further abstracted by dividing by x:
x*s^n=y => s^n=y/x => s^n=z
With z=y/x. z is not an integer in general, but one can only arrive at y using integer multiplication if y is divisible by x. So this property can be easily tested first and after that it is guaranteed that z is also an integer and now it is down to solving s^n=z.
There is already a question related to that.
There are lots of solutions. Some iterative and some solve the equation using a logarithm and either truncate, round or compare with an epsilon. I am particularly interested in the solutions with logarithms. The general idea is:
def check(z,s):
n = log(z)/log(s)
return n == int(n)
Equality comparing floating point numbers does seem pretty sketchy though. Under normal circumstances I would not count that as a general and exact solution to the problem. Answers that suggest this approach don't mention the precision issue and answers that use an epsilon for comparing just take a randomly small number.
I wonder how robust this method (with straight equality) really is, because it seems to work pretty well and I couldn't break it with trial and error. And if it breaks down at some point, how small or large the epsilon has to be.
So basically my question is:
Can the logarithm approach be guaranteed to be exact under specific circumstances? E.g. limited integer input range.
I thought about this for a long time now and I think, that it is possible that this solution is exact and robust at least under some circumstances. But I don't have a proof for that.
My line of thinking was:
Can I find a combination of x,y,s so that the chain-multiply just barely misses y, which means that n will be very close to an integer but not quite?
The answer is no. Because x, y and s are integers, the multiplication will also be an integer. So if the result just barely misses y, it has to miss by at least 1.
Well, that is how far I've gotten. My theory is, that choosing only integers makes the calculation very precise. I would consider it a method with good numerical stability. And also with a very specific behaviour regarding stability. So I believe it is possible, that this calculation is precise enough to truncate all decimals. It would be insane if someone could prove or disprove that.
If a guarantee for correctness can be given for a specific value range, I am interested in the general approach, but a fairly applicable range of values would be the positive part of int32 for the integers and double floating point precision.
Testing with an epsilon is also an option, but then the question is how small that epsilon has to be. This is probably related to the "miss by at least 1" logic.
You’re right to be skeptical of floating point. Math libraries typically
don’t provide correctly rounded transcendental functions (The Table
Maker’s Dilemma), so the exact test is suspect. Indeed, it’s not
difficult to find counterexamples (see the Python below).
Since the input z is an integer, however, we can do an error analysis to
determine an appropriate epsilon. Using calculus, one can prove a bound
log(z+1) − log(z) = log(1 + 1/z) ≥ 1/z − 1/(2z2).
If log(z)/log(s) is not an integer, then z must be at least one away
from a power of s, putting this bound in play. If 2 ≥ z, s <
231 (have representations as signed 32-bit integers), then
log(z)/log(s) is at least (1/231 −
1/263)/log(231) away from integer. An epsilon of
1.0e-12 is comfortably less than this, yet large enough that if we lose
a couple of ulps (1 ulp is on the order of 3.6e-15 in the worst case
here) to rounding, we don’t get a false negative, even with a rather
poor quality implementation of log.
import math
import random
while True:
x = random.randrange(2, 2**15)
if math.log(x**2) / math.log(x) != 2:
print("#", x)
break
# 19143

Discrepancy between diagram and equations of GRU?

While I was reading the blog of Colah,
In the diagram we can clearly see that zt is going to
~ht and not rt
But the equations say otherwise. Isn’t this supposed to be zt*ht-1 And not rt*ht-1.
Please correct me if I’m wrong.
I see this is somehow old, however, if you still haven't figured it out and care, or for any other person who would end up here, the answer is that the figure and equations are consistent. Note that, the operator (x) in the diagram (the pink circle with an X in it) is the Hadamard product, which is an element-wise multiplication between two tensors of the same size. In the equations, this operator is illustrated by * (usually it is represented by a circle and a dot at its center). ~h_t is the output of the tanh operator. The tanh operator receives a linear combination of the input at time t, x_t, and the result of the Hadamard product between r_t and h_{t-1}. Note that r_t should have already been updated by passing the linear combination of x_t and h_{t-1} through a sigmoid. I hope the reset is clear.

What does 1. mean in a mathematica solution (of a sum)

I'm trying to evaluate a difficult sum: mathematica seems to evaluate it, giving the message "Solve was unable to solve the system with inexact coefficients. The answer was obtained by solving a corresponding exact system and numericizing the result"
The solution contains expressions "1." such as (0.5 + 1.i).
What does the 1. mean?
You can look at a similar question here. Mathematica interprets the input 0.5 (or any input containing 0.5), for example, as "numerical," and so its attempts to solve it will be numerical in nature, assuming that 0.5 is some real number that is within whatever relevant level of precision that it looks like it's equal to 0.5. Even though 0.5==1/2 will return True, Mathematica still treats those two expressions very differently.
If you input some commands using "numerical" (ie. decimal) numbers, Mathematica falls to numerical methods (like NIntegrate, NSolve, NDSolve, numerical versions of arithmetic operations, etc.) rather than those that apply to integers, rationals, etc.
The error that occurs is due to how NSolve (or another such algorithm) works. But it then takes the step of making the equations exact (it does know, after all, that 0.5=1/2) and then gets an exact solution, but then it "numericizes" the result (hits it with an N command) to give you the numerical equivalent.
Type in N[1/2+I] and see what you get. Should be 0.5+1.i. All this means is that you have a quantity that is roughly 1.0000000000000000 in the imaginary direction and 0.50000000000000 in the real direction.
To see the difference explicitly, try:
Head[1]
Head[1.]
The decimal point indicates to Mathematica that the second of the two is a "real" number, i.e. for floating point arithmetic of some sort. The first one is an integer, for which Mathematica sometimes uses different sorts of algorithms.
The "1." is there to guarantee that subsequent use of that expression doesn't lose that the expression was obtained numerically, and is therefore subject to numerical precision. For example,
In[121]:= Pi/3.14`2 * x
Out[121]= 1.0 x
Even though you might think that 1.0*x == x, it's certainly not true that Pi==3.14; rather, Pi is only 3.14 to the given precision of 2. By including the 1.0 in the answer (which InputForm shows is actually internally 1.00050721451904243263141509021640261145`2) the next evaluation,
In[122]:= % /. x -> 3
Out[122]= 3.0
comes out correct instead of incorrectly giving an exact 3.

Why isn't the prior state vector in the forward-backward algorithm the eigenvector of the transition matrix that has an eigenvalue of 1?

Wikipedia says you have no knowledge of what the first state is, so you have to assign each state equal probability in the prior state vector. But you do know what the transition probability matrix is, and the eigenvector that has an eigenvalue of 1 of that matrix is the frequency of each state in the HMM (i think), so why don't you go with that vector for the prior state vector instead?
This is really a modelling decision. Your suggestion is certainly possible, because it pretty much corresponds to prefixing the observations with a large stretch of observations where the hidden states are not observed at all or have no effect - this will give whatever the original states are time to settle down to the equilibrium distribution.
But if you have a stretch of observations with a delimited start, such as a segment of speech that starts when the speaker starts, or a segment of text that starts at the beginning of a sentence, there is no particular reason to believe that the distribution of the very first state is the same as the equilibrium distribution: I doubt very much if 'e' is the most common character at the start of a sentence, whereas it is well known to be the most common character in English text.
It may not matter very much what you choose, unless you have a lot of very short sequences of observations that you are processing together. Most of the time I would only worry if you wanted to set one of the state probabilities to zero, because the EM algorithm or Baum-Welch algorithm often used to optimise HMM parameters can be reluctant to re-estimate parameters away from zero.

Create a function for given input and ouput

Imagine, there are two same-sized sets of numbers.
Is it possible, and how, to create a function an algorithm or a subroutine which exactly maps input items to output items? Like:
Input = 1, 2, 3, 4
Output = 2, 3, 4, 5
and the function would be:
f(x): return x + 1
And by "function" I mean something slightly more comlex than [1]:
f(x):
if x == 1: return 2
if x == 2: return 3
if x == 3: return 4
if x == 4: return 5
This would be be useful for creating special hash functions or function approximations.
Update:
What I try to ask is to find out is whether there is a way to compress that trivial mapping example from above [1].
Finding the shortest program that outputs some string (sequence, function etc.) is equivalent to finding its Kolmogorov complexity, which is undecidable.
If "impossible" is not a satisfying answer, you have to restrict your problem. In all appropriately restricted cases (polynomials, rational functions, linear recurrences) finding an optimal algorithm will be easy as long as you understand what you're doing. Examples:
polynomial - Lagrange interpolation
rational function - Pade approximation
boolean formula - Karnaugh map
approximate solution - regression, linear case: linear regression
general packing of data - data compression; some techniques, like run-length encoding, are lossless, some not.
In case of polynomial sequences, it often helps to consider the sequence bn=an+1-an; this reduces quadratic relation to linear one, and a linear one to a constant sequence etc. But there's no silver bullet. You might build some heuristics (e.g. Mathematica has FindSequenceFunction - check that page to get an impression of how complex this can get) using genetic algorithms, random guesses, checking many built-in sequences and their compositions and so on. No matter what, any such program - in theory - is infinitely distant from perfection due to undecidability of Kolmogorov complexity. In practice, you might get satisfactory results, but this requires a lot of man-years.
See also another SO question. You might also implement some wrapper to OEIS in your application.
Fields:
Mostly, the limits of what can be done are described in
complexity theory - describing what problems can be solved "fast", like finding shortest path in graph, and what cannot, like playing generalized version of checkers (they're EXPTIME-complete).
information theory - describing how much "information" is carried by a random variable. For example, take coin tossing. Normally, it takes 1 bit to encode the result, and n bits to encode n results (using a long 0-1 sequence). Suppose now that you have a biased coin that gives tails 90% of time. Then, it is possible to find another way of describing n results that on average gives much shorter sequence. The number of bits per tossing needed for optimal coding (less than 1 in that case!) is called entropy; the plot in that article shows how much information is carried (1 bit for 1/2-1/2, less than 1 for biased coin, 0 bits if the coin lands always on the same side).
algorithmic information theory - that attempts to join complexity theory and information theory. Kolmogorov complexity belongs here. You may consider a string "random" if it has large Kolmogorov complexity: aaaaaaaaaaaa is not a random string, f8a34olx probably is. So, a random string is incompressible (Volchan's What is a random sequence is a very readable introduction.). Chaitin's algorithmic information theory book is available for download. Quote: "[...] we construct an equation involving only whole numbers and addition, multiplication and exponentiation, with the property that if one varies a parameter and asks whether the number of solutions is finite or infinite, the answer to this question is indistinguishable from the result of independent tosses of a fair coin." (in other words no algorithm can guess that result with probability > 1/2). I haven't read that book however, so can't rate it.
Strongly related to information theory is coding theory, that describes error-correcting codes. Example result: it is possible to encode 4 bits to 7 bits such that it will be possible to detect and correct any single error, or detect two errors (Hamming(7,4)).
The "positive" side are:
symbolic algorithms for Lagrange interpolation and Pade approximation are a part of computer algebra/symbolic computation; von zur Gathen, Gerhard "Modern Computer Algebra" is a good reference.
data compresssion - here you'd better ask someone else for references :)
Ok, I don't understand your question, but I'm going to give it a shot.
If you only have 2 sets of numbers and you want to find f where y = f(x), then you can try curve-fitting to give you an approximate "map".
In this case, it's linear so curve-fitting would work. You could try different models to see which works best and choose based on minimizing an error metric.
Is this what you had in mind?
Here's another link to curve-fitting and an image from that article:
It seems to me that you want a hashtable. These are based in hash functions and there are known hash functions that work better than others depending on the expected input and desired output.
If what you want a algorithmic way of mapping arbitrary input to arbitrary output, this is not feasible in the general case, as it totally depends on the input and output set.
For example, in the trivial sample you have there, the function is immediately obvious, f(x): x+1. In others it may be very hard or even impossible to generate an exact function describing the mapping, you would have to approximate or just use directly a map.
In some cases (such as your example), linear regression or similar statistical models could find the relation between your input and output sets.
Doing this in the general case is arbitrarially difficult. For example, consider a block cipher used in ECB mode: It maps an input integer to an output integer, but - by design - deriving any general mapping from specific examples is infeasible. In fact, for a good cipher, even with the complete set of mappings between input and output blocks, you still couldn't determine how to calculate that mapping on a general basis.
Obviously, a cipher is an extreme example, but it serves to illustrate that there's no (known) general procedure for doing what you ask.
Discerning an underlying map from input and output data is exactly what Neural Nets are about! You have unknowingly stumbled across a great branch of research in computer science.

Resources