How do binary connections eliminate multiplication? - algorithm

I was reading Neural Network with Few Multiplications and I'm having trouble understanding how Binary or Ternary Connect eliminate the need for multiplication.
They explain that by stochastically sampling the weights from [-1, 0, 1], we eliminate the need to multiply and Wx can be calculated using only sign changes. However, even with weights strictly -1, 0, and 1, how can I change the signs of x without multiplication?
eg. W = [0,1,-1] and x = [0.3, 0.2, 0.4]. Wouldn't I still need to multiply W and x to get [0, 0.2, -0.4]? Or is there some other way to change the sign more efficiently than multiplication?

Yes. All the general-purpose processors I know of since the "early days" (say, 1970) have a machine operation to take the magnitude of one number, the sign of another, and return the result. The data transfer happens in parallel: the arithmetic part of the operation is a single machine cycle.
Many high-level languages have this capability as a built-in function. It often comes under a name such as "copy_sign".

Related

How to find the best combination of parameters from a very large sets?

I have a processing logic which has 11 parameters(let's say from parameter A to parameter K) and different combinations of theses parameters can results in different outcomes.
Processing Logic Example:
if x > A:
x = B
else:
x = C
y = math.sin(2x*x+1.1416)-D
# other logic involving parameter E,F,G,H,I,J,K
return outcome
Here are some examples of the possible values of the parameters(others are similar, discrete):
A ∈ [0.01, 0.02, 0.03, ..., 0.2]
E ∈ [1, 2, 3, 4, ..., 200]
I would like to find the combination of these parameters that results in the best outcome.
However, the problem I am facing is that there are in total
10^19 possible combinations while each combination takes 700ms processing time per CPU core. Obviously, the time to process the whole combinations is unacceptable even I have a large computing cluster.
Could anyone give some advice on what is the correct methodology to handle this problem?
Here is some of my thoughts:
Step 1. Minimize the step interval of each parameter that reduces the total processing time to an acceptable scope, for example:
A ∈ [0.01, 0.05, 0.09, ..., 0.2]
E ∈ [1, 5, 10, 15, ..., 200]
Step 2. Starting from the best combination resulted from step 1, doing a more meticulous research around that combination to find the best combination
But I am afraid that the best combination might hide somewhere that step 1 is not able to perceive, so step 2 is in vain
This is an optimization problem. However, you have two distinct problems in what you posed:
There are no restrictions or properties on the evaluation function;
You accept only the best solution of 10^19 possibilities.
The field of optimization serves up many possibilities, most of which are one variation or another of hill-climbing search and irruptive movement (to help break out of a local maximum that is not the global solution). All of these depend on some manner of continuity or predictability in the evaluation function's dependence on its inputs.
Without that continuity, there is no shorter path to the sole optimal solution.
If you do have some predictability, then you have some reading to do on various solution methods. Start with Newton-Raphson, move on to Gradient Descent, and continue to other topics, depending on the fabric of your function.
Have you thought about purely mathematical approach i.e. trying to find local/global extrema, or based on whether function is monotonic per operation?
There are quite decent numerical methods for derivatives/integrals, even to be used in a relatively-generic manner.
So in other words limit the scope, instead of computing every single option - depends on the general character of operations, that you have in mind.

Force Mathematica to do numerical computations with finite precision

I would like to analyse the numerical stability of analytic expressions in Mathematica. To this end I want to force Mathematica evaluate the expression numerically at finite precision and compare to a result at much higher precision. The problem is that I do not really get it to forget about extra digits it keeps in the background even if I tell it to do so explicitly. Where is the bug in the following?
In[466]:= Sin[2.0]
Out[466]= 0.9092974268256817
In[467]:= Block[{$MaxExtraPrecision = 0}, N[Sin[2.0], 2]]
Out[467]= 0.9092974268256817
In[468]:= Block[{$MaxExtraPrecision = 0}, N[Sin[2.0`2], 2]]
Out[468]= 0.91
In[469]:= SetPrecision[%, 16]
Out[469]= 0.9092974268256817
Even in the third version it keeps many more digits in the background.
Maybe NumberForm is what you need.
NumberForm[expr, n] prints with approximate real numbers in expr
given to n-digit precision.
http://reference.wolfram.com/mathematica/ref/NumberForm.html

In what situations does the difference between random numbers generated on [0,1) and those generated on [0,1] make a difference?

I'm used to pseudo random number generators that return floating point values in the half open interval [0,1).
I've seen some reference to RNGs that can return values on the closed interval [0,1], e.g. this implementation of the Mersenne Twister.
I can see reasons why you'd want to exclude one, or both, of the endpoints for mathematical reasons, e.g.
exponentially_distributed=-logf( 1.0-rng() )
always yields a valid number if 0.0<=rng()<1.0.
But I can't think of a case where replacing an rng yielding [0,1] with one that yields [0,1) would produce any practical difference.
In what situations does having a floating point pseudo random number generator
that returns values on the closed interval [0,1] absolutely necessary?
Maybe if you're randomly generating the probability of an event occurring? If you allow 0, you have to allow 1.
Can't, figure out when the closed interval would be useful, but the open end interval seems the only reasonable to use way to go.
Lets take coin tossing:
If you say rnd() < 0.5 is head and the rest is tail you will get more tails than heads if you use the closed interval. How many more tails depends on how likely it is to actually get 1.
A compelling reason to use a half-open interval is the use case where you are picking a random array index for some array. When you scale from [0, 1) to integers in [0, arrayLength], it's helpful never to get the value arrayLength, since that is not an index in the array in many language implementations. E.g., Java and ArrayIndexOutOfBoundsException. The half-open interval is a great convenience here.
A reason for having a closed interval [0, 1] is Albin's probability argument. But it's worth noting that mathematically speaking, the probability of picking any particular random number, including 1, in [0, 1], is zero. For pseudo random number generators, though, it will pop up occasionally.

Biasing random number generator to some integer n with deviation b

Given an integer range R = [a, b] (where a >=0 and b <= 100), a bias integer n in R, and some deviation b, what formula can I use to skew a random number generator towards n?
So for example if I had the numbers 1 through 10 inclusively and I don't specify a bias number, then I should in theory have equal chances of randomly drawing one of them.
But if I do give a specific bias number (say, 3), then the number generator should be drawing 3 a more frequently than the other numbers.
And if I specify a deviation of say 2 in addition to the bias number, then the number generator should be drawing from 1 through 5 a more frequently than 6 through 10.
What algorithm can I use to achieve this?
I'm using Ruby if it makes it any easier/harder.
i think the simplest route is to sample from a normal (aka gaussian) distribution with the properties you want, and then transform the result:
generate a normal value with given mean and sd
round to nearest integer
if outside given range (normal can generate values over the entire range from -infinity to -infinity), discard and repeat
if you need to generate a normal from a uniform the simplest transform is "box-muller".
there are some details you may need to worry about. in particular, box muller is limited in range (it doesn't generate extremely unlikely values, ever). so if you give a very narrow range then you will never get the full range of values. other transforms are not as limited - i'd suggest using whatever ruby provides (look for "normal" or "gaussian").
also, be careful to round the value. 2.6 to 3.4 should all become 3, for example. if you simply discard the decimal (so 3.0 to 3.999 become 3) you will be biased.
if you're really concerned with efficiency, and don't want to discard values, you can simply invent something. one way to cheat is to mix a uniform variate with the bias value (so 9/10 times generate the uniform, 1/10 times return 3, say). in some cases, where you only care about average of the sample, that can be sufficient.
For the first part "But if I do give a specific bias number (say, 3), then the number generator should be drawing 3 a more frequently than the other numbers.", a very easy solution:
def randBias(a,b,biasedNum=None, bias=0):
x = random.randint(a, b+bias)
if x<= b:
return x
else:
return biasedNum
For the second part, I would say it depends on the task. In a case where you need to generate a billion random numbers from the same distribution, I would calculate the probability of the numbers explicitly and use weighted random number generator (see Random weighted choice )
If you want an unimodal distribution (where the bias is just concentrated in one particular value of your range of number, for example, as you state 3), then the answer provided by andrew cooke is good---mostly because it allows you to fine tune the deviation very accurately.
If however you wish to make several biases---for instance you want a trimodal distribution, with the numbers a, (a+b)/2 and b more frequently than others, than you would do well to implement weighted random selection.
A simple algorithm for this was given in a recent question on StackOverflow; it's complexity is linear. Using such an algorithm, you would simply maintain a list, initial containing {a, a+1, a+2,..., b-1, b} (so of size b-a+1), and when you want to add a bias towards X, you would several copies of X to the list---depending on how much you want to bias. Then you pick a random item from the list.
If you want something more efficient, the most efficient method is called the "Alias method" which was implemented very clearly in Python by Denis Bzowy; once your array has been preprocessed, it runs in constant time (but that means that you can't update the biases anymore once you've done the preprocessing---or you would to reprocess the table).
The downside with both techniques is that unlike with the Gaussian distribution, biasing towards X, will not bias also somewhat towards X-1 and X+1. To simulate this effect you would have to do something such as
def addBias(x, L):
L = concatList(L, [x, x, x, x, x])
L = concatList(L, [x+2])
L = concatList(L, [x+1, x+1])
L = concatList(L, [x-1,x-1,x-1])
L = concatList(L, [x-2])

Is there a fast way to invert a matrix in Matlab?

I have lots of large (around 5000 x 5000) matrices that I need to invert in Matlab. I actually need the inverse, so I can't use mldivide instead, which is a lot faster for solving Ax=b for just one b.
My matrices are coming from a problem that means they have some nice properties. First off, their determinant is 1 so they're definitely invertible. They aren't diagonalizable, though, or I would try to diagonlize them, invert them, and then put them back. Their entries are all real numbers (actually rational).
I'm using Matlab for getting these matrices and for this stuff I need to do with their inverses, so I would prefer a way to speed Matlab up. But if there is another language I can use that'll be faster, then please let me know. I don't know a lot of other languages (a little but of C and a little but of Java), so if it's really complicated in some other language, then I might not be able to use it. Please go ahead and suggest it, though, in case.
I actually need the inverse, so I can't use mldivide instead,...
That's not true, because you can still use mldivide to get the inverse. Note that A-1 = A-1 * I. In MATLAB, this is equivalent to
invA = A\speye(size(A));
On my machine, this takes about 10.5 seconds for a 5000x5000 matrix. Note that MATLAB does have an inv function to compute the inverse of a matrix. Although this will take about the same amount of time, it is less efficient in terms of numerical accuracy (more info in the link).
First off, their determinant is 1 so they're definitely invertible
Rather than det(A)=1, it is the condition number of your matrix that dictates how accurate or stable the inverse will be. Note that det(A)=∏i=1:n λi. So just setting λ1=M, λn=1/M and λi≠1,n=1 will give you det(A)=1. However, as M → ∞, cond(A) = M2 → ∞ and λn → 0, meaning your matrix is approaching singularity and there will be large numerical errors in computing the inverse.
My matrices are coming from a problem that means they have some nice properties.
Of course, there are other more efficient algorithms that can be employed if your matrix is sparse or has other favorable properties. But without any additional info on your specific problem, there is nothing more that can be said.
I would prefer a way to speed Matlab up
MATLAB uses Gauss elimination to compute the inverse of a general matrix (full rank, non-sparse, without any special properties) using mldivide and this is Θ(n3), where n is the size of the matrix. So, in your case, n=5000 and there are 1.25 x 1011 floating point operations. So on a reasonable machine with about 10 Gflops of computational power, you're going to require at least 12.5 seconds to compute the inverse and there is no way out of this, unless you exploit the "special properties" (if they're exploitable)
Inverting an arbitrary 5000 x 5000 matrix is not computationally easy no matter what language you are using. I would recommend looking into approximations. If your matrices are low rank, you might want to try a low-rank approximation M = USV'
Here are some more ideas from math-overflow:
https://mathoverflow.net/search?q=matrix+inversion+approximation
First suppose the eigen values are all 1. Let A be the Jordan canonical form of your matrix. Then you can compute A^{-1} using only matrix multiplication and addition by
A^{-1} = I + (I-A) + (I-A)^2 + ... + (I-A)^k
where k < dim(A). Why does this work? Because generating functions are awesome. Recall the expansion
(1-x)^{-1} = 1/(1-x) = 1 + x + x^2 + ...
This means that we can invert (1-x) using an infinite sum. You want to invert a matrix A, so you want to take
A = I - X
Solving for X gives X = I-A. Therefore by substitution, we have
A^{-1} = (I - (I-A))^{-1} = 1 + (I-A) + (I-A)^2 + ...
Here I've just used the identity matrix I in place of the number 1. Now we have the problem of convergence to deal with, but this isn't actually a problem. By the assumption that A is in Jordan form and has all eigen values equal to 1, we know that A is upper triangular with all 1s on the diagonal. Therefore I-A is upper triangular with all 0s on the diagonal. Therefore all eigen values of I-A are 0, so its characteristic polynomial is x^dim(A) and its minimal polynomial is x^{k+1} for some k < dim(A). Since a matrix satisfies its minimal (and characteristic) polynomial, this means that (I-A)^{k+1} = 0. Therefore the above series is finite, with the largest nonzero term being (I-A)^k. So it converges.
Now, for the general case, put your matrix into Jordan form, so that you have a block triangular matrix, e.g.:
A 0 0
0 B 0
0 0 C
Where each block has a single value along the diagonal. If that value is a for A, then use the above trick to invert 1/a * A, and then multiply the a back through. Since the full matrix is block triangular the inverse will be
A^{-1} 0 0
0 B^{-1} 0
0 0 C^{-1}
There is nothing special about having three blocks, so this works no matter how many you have.
Note that this trick works whenever you have a matrix in Jordan form. The computation of the inverse in this case will be very fast in Matlab because it only involves matrix multiplication, and you can even use tricks to speed that up since you only need powers of a single matrix. This may not help you, though, if it's really costly to get the matrix into Jordan form.

Resources