I am trying to implement Perceptron Algorithm but I am not able to figure out following points.
what should be ideal value for iteration number
is this algorithm suitable for large volumes of data?
does threshold changes with iterations?
if yes what difference does it make in final output?
The Perceptron is not a specific algorithm, it's a name of a cluster of algorithms. There're 2 major differences between these algorithms.
1. Integrate and fire rule
Let the input vector be x, the weights vector be w, the threshold be t and the output value be P(x). There're various function to calculate P(x):
binary: P(x) = 1 (if w * x>=t) or 0 (otherwise)
semi-linear: P(x) = w * x (if w * x>=t) or 0 (otherwise)
hard limit: P(x) = t(if w * x>=t) or w * x (if 0<w * x<t) or 0 (otherwise)
Sigmoid: P(x) = 1 / 1+e^(w * x)
and many others. So it's hard to say what difference does the threshold make in final ouptut, because it depends on which integrate and fire function you use.
2. Learning rule
The learning rule of the Perceptron is various too. The most simple and common one is
w -> w+ a * x* (D(x)- P(x))
where a is the size of a learning step, and D(x) is the expected output to x. So it's also hard to say that what should be a ideal value of iterations, because it depends on the value of a and how many training samples you do have.
Therefore, does thresold changes with iterations? it also depends. The simple and common learning rule above doesn't modify the threshold while training, but there're some other learning rules do modify it.
Btw, you also asked that is this algorothm suitable for large volume of data? The main metrics to measure the suitability of a classifier for certain dataset, is the linear separability of the dataset, not the scale of it. Keep in mind that the Single-layer Perceptron has very bad performance for the datasets which are not linear separable. The scale of the dataset does NOT that matter.
Related
In an upcoming simulation project, I will come in a situation where I will have to draw one random element from a huge vector in a weighted sense. For most elements of the vector, the assigned weight will be zero. I also need to draw only one element, so the replacement or no replacement function is irrelevant.
This random picking step will be the bottleneck for my simulation, so getting the best efficiency and speed will be critical.
Are there any hacks/tips on what is best to do? Are there any important savings possible in the context of my project?
PS: Is randsample reliable on huge vectors?
Knowing that most weights are equal to zero you can rewrite a faster implementation of randsample from Octave source. In my timing it is 6X-7X faster than the original implementation:
function y = randsample_fast(v, w)
f = find(w);
w = w(f);
w = w / sum(w);
w = [0 cumsum(w)];
y = f(lookup (w , rand));
%y = f(find (w <= rand, 1, "last"));
y = v(y);
end
Inputs are assumed to be row vectors.
Changing find to lookup may slightly improve the performance.
Have a look at the source code of randsample.m in the statistics package. It's actually quite a simple implementation. It creates a normalised cumulative weights vector from the weights vector, and then effectively samples it via standard inverse sampling.
I don't know what you mean by 'huge', but as long as the weights vector can fit in memory, there is no reason why this shouldn't be fast.
If by 'huge' you mean something that does not fit in memory, then you could create a 'huge version' of this function that splits the cumulative weights vector into predictable 'bins' saved on disk, and only performs inverse sampling from the right bin.
The only thing I'd add to this is, given the implementation and that you're only interested in a single draw, then you would probably benefit from speed if you specified 'replacement' as 'true' explicitly, since the default is 'false' (i.e. without replacement), and sampling with replacement seems to avoid a lot of unnecessary and expensive steps (permutations etc).
I'm a layman of statistics, after reading some blog and examples about Metropolis-Hastings MCMC algorithms, here i have a question:
In the M-H algorithms, a new sample(say, x') whether be accepted depends on the "acceptance probability", alpha = min(1,A), where
A = p(x’)q(x|x’)/[p(x)q(x’|x)].
Question comes from here,as we don't know the form of interested distribution p(x), how do we calculate the acceptance probability, which contains p(x') and p(x) ?
Or in other words: how does metropolis-hesting algorithm ensure samples obey an unknown posterior distribution.
I am a newbie, please give me some advise.
Here is a chunk of text from https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm
The lax requirement that f(x) should be merely proportional to the density, rather than exactly equal to it, makes the Metropolis–Hastings algorithm particularly useful, because calculating the necessary normalization factor is often extremely difficult in practice.
(end quote)
so your p(x) should be the probability of x under the target distribution times some unknown constant factor. One example of such a p(x) is when you impose a constraint on a known distribution. In this case you don't know p(x), but you know p(x) * K for some unknown K, so you can still calculate p(x)/p(x') = (p(x) * K) / (p(x') * K).
Suppose you want to simulate points distributed within a circle, but you didn't know the value of pi, so you couldn't work out the area of the circle. Then you could make p(x) the uniform distribution on the unit square for points <= distance 0.5 from (0.5, 0.5) and 0 elsewhere. This gives you a p(x) that integrates, not to 1, but to Pi/4, which you don't know, so it's not a proper probability distribution, but it is one times a constant factor.
I am trying to perform inverse inference on a simple bayesian network for piece wise linear regression. That is, y is a piece wise linear function of x :Plot of Y vs X
and the Bayesian network looks like this: Bayesian Network Model
Here, X has a normal distribution, K is a discrete node that has a softmax distribution conditioned on X and Y is a mixture of linear gaussians based on the value of K (i.e. Pr(Y | K=i, X=x) ~ N(mu=w_i*x+b_i, s_i)).
I have learned the parameters of this model using EM algorithm. (The actual relationship of Y and X has five linear pieces, but I have learnt using 8 levels for the discrete node). And formed the pymc model using those parameters. Here is the code:
x=pymc.Normal('x', mu=0.5, tau=1.0/0.095)
#The probabilities of discrete node given x=x; Softmax distribution
epower = [-11.818,54.450,29.270,-13.038,73.541,28.466,-57.530,-101.568]
bias = [7.8228,-35.3859,-12.9512,12.8004,-48.1097,-13.2229,30.6079,39.3811]
#pymc.deterministic(plot=False)
def prob(epower=epower,bias=bias,x=x):
pr=[np.exp(ep*x+bb) for ep, bb in zip(epower, bias)]
return [pri/np.sum(pr) for pri in pr]
knode=pymc.Categorical('knode', p=prob)
#The weights of regression
wtsY=[15.022, -70.000, -14.996, 15.026, -70.000, -14.996, 34.937, 15.027]
#The unconditional means of Y
meansY=[5.9881,68.0000,23.9973,5.9861,68.0000,23.9972,-1.9809,1.9982]
sigmasY=[0.010189,0.010000,0.010033,0.010211,0.010000,0.010036,0.010380,0.010167]
#pymc.deterministic(plot=False)
def condmeanY(knode=knode, x=x,wtsY=wtsY, meansY=meansY):
return wtsY[knode]*x + meansY[knode]
#pymc.deterministic(plot=False)
def condsigmaY(knode=knode, sigmasY=sigmasY):
return sigmasY[knode]
y=pymc.Normal('y', mu=condmeanY, tau=1.0/condsigmaY, value=13.5, observed=True)
I want to predict x, when y is observed (inverse inference). As y is (approximately) non-linear in x, there will be multiple solutions for a given value of y. I expect that the obtained trace of x should show those multiple solutions. I have ensured that autocorrelation is very low (sample=2000, burn=1000). But I am not able to see multiple solutions. In the above example, for y=13.5, there are two possible solutions, x=0.5 and x=0.7. But the chain only wanders near 0.5. The histogram has only one peak, at 0.5.
Am I missing something?
EDIT: I came across this very relevant question:Solving inverse problems with PyMC. What I learned from the answer is that the prior of x, which I am assuming to be uni-modal Gaussian here, should have a non-parametric distribution and then the obtained samples after first iteration can be used to update it. Kernel density estimation (with gaussian kernel) has been suggested to obtain non-parametric stochastic from data. I incorporated this in my model but still there is no difference. One thing I noted is that if I do the inference multiple times, approx 50% of the times, I get 0.5, and 50% of the times, I get 0.7 (I am not sure if this was the case earlier as well, because I had not run that model many times to observe this.) But still, should I not see two peaks in the trace after first iteration only?
I also tried with a modified version of this model, where the edge from X to K is reversed. This is a classical conditional linear Gaussian model. Even with this model, I could not get multiple solutions visible in the trace. I am sort of stuck here. Please help.
I have been learning about Machine learning algorithms this semester but I cant seem to understand how the parameters theta are used once Gradient decent is ran and they are updated, specifically in Logistic regression, In short my question is how is the decision boundary piloted after the parameters theta are updated.
After you use gradient descent to estimate your parameters theta, you can use those calculated parameters to make predictions.
For any input x, you can now calculate an predicted outcome y.
Ultimately the goal of machine learning is to make predictions.
So you take a whole bunch of observations x and y. Where x is your input and y is your output. In case of logistic regression, y is one of two values. For example, take a bunch of emails (x) that are labeled spam or no spam (y is 1 for spam and 0 for no spam). Or take a bunch of medical images that are labeled healthy or non healthy. ...
Feed all that data in your machine learning algorithm. Your algorithm (gradient descent for example), will calculate the theta coefficients.
Now you can use these theta coefficient to make predictions for new values of x. For example a new email that the system has never seen, using the theta coefficient, you can predict whether it is spam or not.
As far a plotting the decision boundary. This is probably feasible when you have two dimensions for x. You can have one dimension on each axis. And the resulting dots in your graph would be your y values. You could color them differently or show a different shape whether the result is one way or the other (i.e. your y is 0 or 1).
In practicality, these plots are useful during a lecture to get a general gist of what you're trying to do or accomplish. In reality, every input X would probably be a vector of many values (way more than 2). And thus it becomes impossible to plot a decision boundary.
Typically, logistic regression is parametrized in a following way:
cl(x|theta) = 1 / (1 + exp(-SUM_{i=1}^d theta_i x_i + theta_0 )) ) > 0.5
which is equivalent to
cl(x|theta) = sign(SUM_{i=1}^d theta_i x_i + theta_0 )
so once you get your theta, you use it to make a prediction by computing a simple weighted sum of your data representation and you check the sign of such number.
I've seen some machine learning questions on here so I figured I would post a related question:
Suppose I have a dataset where athletes participate at running competitions of 10 km and 20 km with hilly courses i.e. every competition has its own difficulty.
The finishing times from users are almost inverse normally distributed for every competition.
One can write this problem as a matrix:
Comp1 Comp2 Comp3
User1 20min ?? 10min
User2 25min 20min 12min
User3 30min 25min ??
User4 30min ?? ??
I would like to complete the matrix above which has the size 1000x20 and a sparseness of 8 % (!).
There should be a very easy way to complete this matrix, since I can calculate parameters for every user (ability) and parameters for every competition (mu, lambda of distributions). Moreover the correlation between the competitions are very high.
I can take advantage of the rankings User1 < User2 < User3 and Item3 << Item2 < Item1
Could you maybe give me a hint which methods I could use?
Your astute observation that this is a matrix completion problem gets
you most of the way to the solution. I'll codify your intuition that
the combination of ability of a user and difficulty of the course
yields the time of a race, then present various algorithms.
Model
Let the vector u denote the speed of the users so that u_i is user i's
speed. Let the vector v denote the difficulty of the courses so
that v_j is course j's difficulty. Also when available, let t_ij be user i's time on
course j, and define y_ij = 1/t_ij, user i's speed on course j.
Since you say the times are inverse Gaussian distributed, a sensible
model for the observations is
y_ij = u_i * v_j + e_ij,
where e_ij is a zero-mean Gaussian random variable.
To fit this model, we search for vectors u and v that minimize the
prediction error among the observed speeds:
f(u,v) = sum_ij (u_i * v_j - y_ij)^2
Algorithm 1: missing value Singular Value Decomposition
This is the classical Hebbian
algorithm. It
minimizes the above cost function by gradient descent. The gradient of
f wrt to u and v are
df/du_i = sum_j (u_i * v_j - y_ij) v_j
df/dv_j = sum_i (u_i * v_j - y_ij) u_i
Plug these gradients into a Conjugate Gradient solver or BFGS
optimizer, like MATLAB's fmin_unc or scipy's optimize.fmin_ncg or
optimize.fmin_bfgs. Don't roll your own gradient descent unless you're willing to implement a very good line search algorithm.
Algorithm 2: matrix factorization with a trace norm penalty
Recently, simple convex relaxations to this problem have been
proposed. The resulting algorithms are just as simple to code up and seem to
work very well. Check out, for example Collaborative Filtering in a Non-Uniform World:
Learning with the Weighted Trace Norm. These methods minimize
f(m) = sum_ij (m_ij - y_ij)^2 + ||m||_*,
where ||.||_* is the so-called nuclear norm of the matrix m. Implementations will end up again computing gradients with respect to u and v and relying on a nonlinear optimizer.
There are several ways to do this, perhaps the best architecture to try first is the following:
(As usual, as a preprocessing step normalize your data into a uniform function with 0 mean and 1 std deviation as best you can. You can do this by fitting a function to the distribution of all race results, applying its inverse, and then subtracting the mean and dividing by the std deviation.)
Select a hyperparameter N (you can tune this as usual with a cross validation set).
For each participant and each race create an N-dimensional feature vector, initially random. So if there are R races and P participants then there are R+P feature vectors with a total of N(R+P) parameters.
The prediction for a given participant and a given race is a function of the two corresponding feature vectors (as a first try use the scalar product of these two vectors).
Alternate between incrementally improving the participant feature vectors and the race feature vectors.
To improve a feature vector use gradient descent (or some more complex optimization method) on the known data elements (the participant/race pairs for which you have a result).
That is your loss function is:
total_error = 0
forall i,j
if (Participant i participated in Race j)
actual = ActualRaceResult(i,j)
predicted = ScalarProduct(ParticipantFeatures_i, RaceFeatures_j)
total_error += (actual - predicted)^2
So calculate the partial derivative of this function wrt the feature vectors and adjust them incrementally as per a usual ML algorithm.
(You should also include a regularization term on the loss function, for example square of the lengths of the feature vectors)
Let me know if this architecture is clear to you or you need further elaboration.
I think this is a classical task of missing data recovery. There exist some different methods. One of them which I can suggest is based on Self Organizing Feature Map (Kohonen's Map).
Below it's assumed that every athlet record is a pattern, and every competition data is a feature.
Basically, you should divide your data into 2 sets: first - with fully defined patterns, and second - patterns with partially lost features. I assume this is eligible because sparsity is 8%, that is you have enough data (92%) to train net on undamaged records.
Then you feed first set to the SOM and train it on this data. During this process all features are used. I'll not copy algorithm here, because it can be found in many public sources, and even some implementations are available.
After the net is trained, you can feed patterns from the second set to the net. For each pattern the net should calculate best matching unit (BMU), based only on those features that exist in the current pattern. Then you can take from the BMU its weigths, corresponding to missing features.
As alternative, you could not divide the whole data into 2 sets, but train the net on all patterns including the ones with missing features. But for such patterns learning process should be altered in the similar way, that is BMU should be calculated only on existing features in every pattern.
I think you can have a look at the recent low rank matrix completion methods.
The assumption is that your matrix has a low rank compared to the matrix dimension.
min rank(M)
s.t. ||P(M-M')||_F=0
M is the final result, and M' is the uncompleted matrix you currently have.
This algorithm minimizes the rank of your matrix M. P in the constraint is an operator that takes the known terms of your matrix M', and constraint those terms in M to be the same as in M'.
The optimization of this problem has a relaxed version, which is:
min ||M||_* + \lambda*||P(M-M')||_F
rank(M) is relaxed to its convex hull ||M||_* Then you trade off the two terms by controlling the parameter lambda.