PyTorch - shape of nn.Linear weights - matrix

Yesterday I came across this question and for the first time noticed that the weights of the linear layer nn.Linear need to be transposed before applying matmul.
Code for applying the weights:
output = input.matmul(weight.t())
What is the reason for this?
Why are the weights not in the transposed shape just from the beginning, so they don't need to be transposed every time before applying the layer?

I found an answer here:
Efficient forward pass in nn.Linear #2159
It seems like there is no real reasoning behind this. However the transpose operation doesn't seem to be slowing down the computation.
According to the issue mentioned above, during the forward pass the transpose operation is (almost) free in terms of computation. While during the backward pass leaving out the transpose operation would actually make computation less efficient with the current implementation.
The last post in that issue sums it up quite nicely:
It's historical weight layout, changing it is backward-incompatible.
Unless there is some BIG benefit in terms of speed or convenience, we
wont break userland.
https://github.com/pytorch/pytorch/issues/2159#issuecomment-390068272

Related

Is there a canonical/performant way to reduce arrays/matrices by removing the border values?

A motivating issue, implemented in Matlab:
N = 1000;
R = zeros(2*N);
for i=0:N-1
R = R(2:end-1, 2:end-1);
end
For this code timeit() gives a time 2.9793 on my machine. It isn't really great.
By canonical way I mean a discussion that isn't just acceptable, but a performant implementation that respects very large matrices reduced. I would be very appreciative of any answer, referrals to other discussions or literature.
As for language, I am not really a programmer, this question is motivated by a mathematics inquiry and I have encountered performance issues implementing any such reduction process in Matlab. Is there a solution to this in Matlab, or must one delve into the scary depths of C/C++?
One note: One may ask, why not just keep the matrix as is and consider parts of it as needed? To clarify, the reduction process in practice of course depends on the actual (nonzero) values of the elements, e.g. by processing the matrix in 2x2 blocks, and the removal of edge-values is needed to prepare the matrix for then next reduction step.
R(2:end-1, 2:end-1) is the correct way of extracting the part of the array that is all values except the ones at the edges. This requires copying the data, so will take some time. There is no legal way around the copy, and no alternative for extracting a part of the array. (subsref might seems like an alternative, but is the function that is internally called for the given syntax.)
As for illegal ways, you could try James Tursa’s sharedchild from the MATLAB FileExchange. It allows to create an array that references subsets of the data of another array. James is well known in the MATLAB user community as one of the people reverse-engineering the system and bending it to his will. This is solid code. But every version of MATLAB introduces new changes to the infrastructure, so upgrading MATLAB might break your program if you use this code.
You don't need the for loop. If you want to remove L elements from the borders, simply do:
R=R(L+1:end-L, L+1:end-L)
I am surprised you didn't get an error with that code. I think you should end up with an empty matrix at the end of the loop.

robust online algorithm for semi-variance

I'm looking for the equivalent of welford's algorithm for the online computation semi-variance (downside partial variance). Does anyone know of a good reference? Does such an algorithm even exist?
Edit: the case where the semi-variance is taken relative to a fixed target is trivial. the problem is calculating the semi-variance in relation to the mean
I believe the answer is one does not exist and I'm going to try to outline a proof of why this is so.
Consider a 'uesful' online algorithm to be defined by two criteria:
It must have fixed memory requirements during processing.
Each update should take a fixed amount of time.
This is stricter than the literal definition of an sequential/incremental/online algorithm which really just requires that data can be passed in one piece at a time. However, consider that if either 1) or 2) were not true then after processing a large enough amounts of elements, the memory required or time required to run the algorithm would eventually become infeasible. Usually, one of the reasons why online algorithms are used is that they can be used continuously without fear of the performance slowly getting worse. Also, note that there are online algorithms for calculating the mean and variance that satisfy both 1 & 2 and I think that's what we are aiming to achieve.
Now to the problem posed. During processing, the mean will change with every bit of new data. That in turn means the set of observations that fall below the mean will change. When this happens, we need to adjust our running semi-variance according to the set "delta", defined as the elements that are not in the union between the set of elements below the old mean and the set of elements below the new mean. We will have to calculate this delta in the process of adjusting the old-semivariance to the new-semivariance in the presence of new data.
Now let's consider the complexity of calculating this set delta. We will need to find all elements that fall between the old mean and the new mean. We will always keep track of the old mean, while the new mean can be calculated incrementally in fixed time so they pose no problem. However to calculate the delta itself, there is no way to do it other than requiring us to keep track of all the previous elements in our set. This immediately breaks the memory condition of an online algorithm. Secondly, even if we keep the previous elements in our set sorted, the best speed we can achieve to find those that are between the old mean and new mean is O(log(number of elements)), which is worse than fixed. So eventually, with enough elements, the online algorithm will not only require more memory than we have, but it will also require more time.
http://www3.sympatico.ca/jean-v.cote/computation_of_semi-variance.pdf
P.S.:This is not an incremental computation. I have another idea. I will keep you posted.

Multiple parameter optimization with lots of local minima

I'm looking for algorithms to find a "best" set of parameter values. The function in question has a lot of local minima and changes very quickly. To make matters even worse, testing a set of parameters is very slow - on the order of 1 minute - and I can't compute the gradient directly.
Are there any well-known algorithms for this kind of optimization?
I've had moderate success with just trying random values. I'm wondering if I can improve the performance by making the random parameter chooser have a lower chance of picking parameters close to ones that had produced bad results in the past. Is there a name for this approach so that I can search for specific advice?
More info:
Parameters are continuous
There are on the order of 5-10 parameters. Certainly not more than 10.
How many parameters are there -- eg, how many dimensions in the search space? Are they continuous or discrete - eg, real numbers, or integers, or just a few possible values?
Approaches that I've seen used for these kind of problems have a similar overall structure - take a large number of sample points, and adjust them all towards regions that have "good" answers somehow. Since you have a lot of points, their relative differences serve as a makeshift gradient.
Simulated
Annealing: The classic approach. Take a bunch of points, probabalistically move some to a neighbouring point chosen at at random depending on how much better it is.
Particle
Swarm Optimization: Take a "swarm" of particles with velocities in the search space, probabalistically randomly move a particle; if it's an improvement, let the whole swarm know.
Genetic Algorithms: This is a little different. Rather than using the neighbours information like above, you take the best results each time and "cross-breed" them hoping to get the best characteristics of each.
The wikipedia links have pseudocode for the first two; GA methods have so much variety that it's hard to list just one algorithm, but you can follow links from there. Note that there are implementations for all of the above out there that you can use or take as a starting point.
Note that all of these -- and really any approach to this large-dimensional search algorithm - are heuristics, which mean they have parameters which have to be tuned to your particular problem. Which can be tedious.
By the way, the fact that the function evaluation is so expensive can be made to work for you a bit; since all the above methods involve lots of independant function evaluations, that piece of the algorithm can be trivially parallelized with OpenMP or something similar to make use of as many cores as you have on your machine.
Your situation seems to be similar to that of the poster of Software to Tune/Calibrate Properties for Heuristic Algorithms, and I would give you the same advice I gave there: consider a Metropolis-Hastings like approach with multiple walkers and a simulated annealing of the step sizes.
The difficulty in using a Monte Carlo methods in your case is the expensive evaluation of each candidate. How expensive, compared to the time you have at hand? If you need a good answer in a few minutes this isn't going to be fast enough. If you can leave it running over night, it'll work reasonably well.
Given a complicated search space, I'd recommend a random initial distributed. You final answer may simply be the best individual result recorded during the whole run, or the mean position of the walker with the best result.
Don't be put off that I was discussing maximizing there and you want to minimize: the figure of merit can be negated or inverted.
I've tried Simulated Annealing and Particle Swarm Optimization. (As a reminder, I couldn't use gradient descent because the gradient cannot be computed).
I've also tried an algorithm that does the following:
Pick a random point and a random direction
Evaluate the function
Keep moving along the random direction for as long as the result keeps improving, speeding up on every successful iteration.
When the result stops improving, step back and instead attempt to move into an orthogonal direction by the same distance.
This "orthogonal direction" was generated by creating a random orthogonal matrix (adapted this code) with the necessary number of dimensions.
If moving in the orthogonal direction improved the result, the algorithm just continued with that direction. If none of the directions improved the result, the jump distance was halved and a new set of orthogonal directions would be attempted. Eventually the algorithm concluded it must be in a local minimum, remembered it and restarted the whole lot at a new random point.
This approach performed considerably better than Simulated Annealing and Particle Swarm: it required fewer evaluations of the (very slow) function to achieve a result of the same quality.
Of course my implementations of S.A. and P.S.O. could well be flawed - these are tricky algorithms with a lot of room for tweaking parameters. But I just thought I'd mention what ended up working best for me.
I can't really help you with finding an algorithm for your specific problem.
However in regards to the random choosing of parameters I think what you are looking for are genetic algorithms. Genetic algorithms are generally based on choosing some random input, selecting those, which are the best fit (so far) for the problem, and randomly mutating/combining them to generate a next generation for which again the best are selected.
If the function is more or less continous (that is small mutations of good inputs generally won't generate bad inputs (small being a somewhat generic)), this would work reasonably well for your problem.
There is no generalized way to answer your question. There are lots of books/papers on the subject matter, but you'll have to choose your path according to your needs, which are not clearly spoken here.
Some things to know, however - 1min/test is way too much for any algorithm to handle. I guess that in your case, you must really do one of the following:
get 100 computers to cut your parameter testing time to some reasonable time
really try to work out your parameters by hand and mind. There must be some redundancy and at least some sanity check so you can test your case in <1min
for possible result sets, try to figure out some 'operations' that modify it slightly instead of just randomizing it. For example, in TSP some basic operator is lambda, that swaps two nodes and thus creates new route. Your can be shifting some number up/down for some value.
then, find yourself some nice algorithm, your starting point can be somewhere here. The book is invaluable resource for anyone who starts with problem-solving.

Smoothing values over time: moving average or something better?

I'm coding something at the moment where I'm taking a bunch of values over time from a hardware compass. This compass is very accurate and updates very often, with the result that if it jiggles slightly, I end up with the odd value that's wildly inconsistent with its neighbours. I want to smooth those values out.
Having done some reading around, it would appear that what I want is a high-pass filter, a low-pass filter or a moving average. Moving average I can get down with, just keep a history of the last 5 values or whatever, and use the average of those values downstream in my code where I was once just using the most recent value.
That should, I think, smooth out those jiggles nicely, but it strikes me that it's probably quite inefficient, and this is probably one of those Known Problems to Proper Programmers to which there's a really neat Clever Math solution.
I am, however, one of those awful self-taught programmers without a shred of formal education in anything even vaguely related to CompSci or Math. Reading around a bit suggests that this may be a high or low pass filter, but I can't find anything that explains in terms comprehensible to a hack like me what the effect of these algorithms would be on an array of values, let alone how the math works. The answer given here, for instance, technically does answer my question, but only in terms comprehensible to those who would probably already know how to solve the problem.
It would be a very lovely and clever person indeed who could explain the sort of problem this is, and how the solutions work, in terms understandable to an Arts graduate.
If you are trying to remove the occasional odd value, a low-pass filter is the best of the three options that you have identified. Low-pass filters allow low-speed changes such as the ones caused by rotating a compass by hand, while rejecting high-speed changes such as the ones caused by bumps on the road, for example.
A moving average will probably not be sufficient, since the effects of a single "blip" in your data will affect several subsequent values, depending on the size of your moving average window.
If the odd values are easily detected, you may even be better off with a glitch-removal algorithm that completely ignores them:
if (abs(thisValue - averageOfLast10Values) > someThreshold)
{
thisValue = averageOfLast10Values;
}
Here is a guick graph to illustrate:
The first graph is the input signal, with one unpleasant glitch. The second graph shows the effect of a 10-sample moving average. The final graph is a combination of the 10-sample average and the simple glitch detection algorithm shown above. When the glitch is detected, the 10-sample average is used instead of the actual value.
If your moving average has to be long in order to achieve the required smoothing, and you don't really need any particular shape of kernel, then you're better off if you use an exponentially decaying moving average:
a(i+1) = tiny*data(i+1) + (1.0-tiny)*a(i)
where you choose tiny to be an appropriate constant (e.g. if you choose tiny = 1- 1/N, it will have the same amount of averaging as a window of size N, but distributed differently over older points).
Anyway, since the next value of the moving average depends only on the previous one and your data, you don't have to keep a queue or anything. And you can think of this as doing something like, "Well, I've got a new point, but I don't really trust it, so I'm going to keep 80% of my old estimate of the measurement, and only trust this new data point 20%". That's pretty much the same as saying, "Well, I only trust this new point 20%, and I'll use 4 other points that I trust the same amount", except that instead of explicitly taking the 4 other points, you're assuming that the averaging you did last time was sensible so you can use your previous work.
Moving average I can get down with ...
but it strikes me that it's probably
quite inefficient.
There's really no reason a moving average should be inefficient. You keep the number of data points you want in some buffer (like a circular queue). On each new data point, you pop the oldest value and subtract it from a sum, and push the newest and add it to the sum. So every new data point really only entails a pop/push, an addition and a subtraction. Your moving average is always this shifting sum divided by the number of values in your buffer.
It gets a little trickier if you're receiving data concurrently from multiple threads, but since your data is coming from a hardware device that seems highly doubtful to me.
Oh and also: awful self-taught programmers unite! ;)
An exponentially decaying moving average can be calculated "by hand" with only the trend if you use the proper values. See http://www.fourmilab.ch/hackdiet/e4/ for an idea on how to do this quickly with a pen and paper if you are looking for “exponentially smoothed moving average with 10% smoothing”. But since you have a computer, you probably want to be doing binary shifting as opposed to decimal shifting ;)
This way, all you need is a variable for your current value and one for the average. The next average can then be calculated from that.
there's a technique called a range gate that works well with low-occurrence spurious samples. assuming the use of one of the filter techniques mentioned above (moving average, exponential), once you have "sufficient" history (one Time Constant) you can test the new, incoming data sample for reasonableness, before it is added to the computation.
some knowledge of the maximum reasonable rate-of-change of the signal is required. the raw sample is compared to the most recent smoothed value, and if the absolute value of that difference is greater than the allowed range, that sample is thrown out (or replaced with some heuristic, eg. a prediction based on slope; differential or the "trend" prediction value from double exponential smoothing)

Genetic Algorithms applied to Curve Fitting

Let's imagine I have an unknown function that I want to approximate via Genetic Algorithms. For this case, I'll assume it is y = 2x.
I'd have a DNA composed of 5 elements, one y for each x, from x = 0 to x = 4, in which, after a lot of trials and computation and I'd arrive near something of the form:
best_adn = [ 0, 2, 4, 6, 8 ]
Keep in mind I don't know beforehand if it is a linear function, a polynomial or something way more ugly, Also, my goal is not to infer from the best_adn what is the type of function, I just want those points, so I can use them later.
This was just an example problem. In my case, instead of having only 5 points in the DNA, I have something like 50 or 100. What is the best approach with GA to find the best set of points?
Generating a population of 100,
discard the worse 20%
Recombine the remaining 80%? How?
Cutting them at a random point and
then putting together the first
part of ADN of the father with the
second part of ADN of the mother?
Mutation, how should I define in
this kind of problem mutation?
Is it worth using Elitism?
Any other simple idea worth using
around?
Thanks
Usually you only find these out by experimentation... perhaps writing a GA to tune your GA.
But that aside, I don't understand what you're asking. If you don't know what the function is, and you also don't know the points to being with, how do you determine fitness?
From my current understanding of the problem, this is better fitted by a neural network.
edit:
2.Recombine the remaining 80%? How? Cutting them at a random point and then putting together the first part of ADN of the father with the second part of ADN of the mother?
This is called crossover. If you want to be saucey, do something like pick a random starting point and swapping a random length. For instance, you have 10 elements in an object. randomly choose a spot X between 1 and 10 and swap x..10-rand%10+1.. you get the picture... spice it up a little.
3.Mutation, how should I define in this kind of problem mutation?
usually that depends more on what is defined as a legal solution than anything else. you can do mutation the same way you do crossover, except you fill it with random data (that is legal) rather than swapping with another specimen... and you do it at a MUCH lower rate.
4.Is it worth using Elitism?
experiment and find out.
Gaussian adaptation usually outperforms standard genetic algorithms. If you don't want to write your own package from scratch, the Mathematica Global Optimization package is EXCELLENT -- I used it to fit a really nasty nonlinear function where standard fitters failed miserably.
Edit:
Wikipedia Article
If you hunt down prints of the listed papers on the article, you can find whitepapers and implementations. In general though, you should have some idea what the solution space for your maximizing the fitness function look like. If the number of variables is small, or the number of local maxima is small or they are connected/slope down to a global maxima, simple least squares works fine. If the area around each local maxima is small (IE you have to get a damned good solution to hit the best one, otherwise you hit a bad one), then fancier algorithms are needed.
Choosing variables for a genetic algorithm depends on what the solution space will look like.

Resources