I'm looking at VW's docs for update rule options, and I'm confused about the equation that specifies the learning rate schedule using the parameters
initial_t,
power_t,
and decay_learning_rate.
Based on the equation below this line in the docs
specify the learning rate schedule whose generic form
if initial_t is equal to zero (which is the setting by default), it seems that the learning rate will always be zero, for all timesteps and epochs. Is this right?
Also, what would happen if both initial_t and power_t are set to zero? I tried initializing a VW with those settings and it didn't complain.
if initial_t is equal to zero (which is the setting by default), it seems that the learning rate will always be zero, for all timesteps and epochs. Is this right?
initial_t is set to zero by default. By default the initial learning rate will not use initial_t to calculate its value but will start off at its default value, which is 0.5.
Per the documentation, the flags adaptive, normalized, and invariant are on by default. If any of them is specified, the other flags are turned off. In the case that you turn on the invariant flag (so in the case that we are not using normalized or adaptive) the initial learning rate will be calculated using the initial_t and power_t values, and the default initial_t is set to one instead of zero.
If initial_t is explicitly set to zero combined with the invariant flag being set, then yes, the learning rate will also be zero.
Also, what would happen if both initial_t and power_t are set to zero? I tried initializing a VW with those settings and it didn't complain.
If the initial learning rate is calculated using initial_t and power_t and both are explicitly set to zero, c++ should evaluate powf(0,0) to 1 resulting in the learning rate set to its default value, which can be specified by --learning_rate
If you are running vowpalwabbit via the command line, you should be able to see what these values are set to:
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
decay_learning_rate = 1
Related
I'm training a Doc2Vec model using the below code, where tagged_data is a list of TaggedDocument instances I set up before:
max_epochs = 40
model = Doc2Vec(alpha=0.025,
min_alpha=0.001)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.iter)
# decrease the learning rate
model.alpha -= 0.001
# fix the learning rate, no decay
model.min_alpha = model.alpha
model.save("d2v.model")
print("Model Saved")
When I later check the model results, they're not good. What might have gone wrong?
Do not call .train() multiple times in your own loop that tries to do alpha arithmetic.
It's unnecessary, and it's error-prone.
Specifically, in the above code, decrementing the original 0.025 alpha by 0.001 forty times results in (0.025 - 40*0.001) -0.015 final alpha, which would also have been negative for many of the training epochs. But a negative alpha learning-rate is nonsensical: it essentially asks the model to nudge its predictions a little bit in the wrong direction, rather than a little bit in the right direction, on every bulk training update. (Further, since model.iter is by default 5, the above code actually performs 40 * 5 training passes – 200 – which probably isn't the conscious intent. But that will just confuse readers of the code & slow training, not totally sabotage results, like the alpha mishandling.)
There are other variants of error that are common here, as well. If the alpha were instead decremented by 0.0001, the 40 decrements would only reduce the final alpha to 0.021 – whereas the proper practice for this style of SGD (Stochastic Gradient Descent) with linear learning-rate decay is for the value to end "very close to 0.000"). If users start tinkering with max_epochs – it is, after all, a parameter pulled out on top! – but don't also adjust the decrement every time, they are likely to far-undershoot or far-overshoot 0.000.
So don't use this pattern.
Unfortunately, many bad online examples have copied this anti-pattern from each other, and make serious errors in their own epochs and alpha handling. Please don't copy their error, and please let their authors know they're misleading people wherever this problem appears.
The above code can be improved with the much-simpler replacement:
max_epochs = 40
model = Doc2Vec() # of course, if non-default parameters needed, use them here
# most users won't need to change alpha/min_alpha at all
# but many will want to use more than default `epochs=5`
model.build_vocab(tagged_data)
model.train(tagged_data, total_examples=model.corpus_count, epochs=max_epochs)
model.save("d2v.model")
Here, the .train() method will do exactly the requested number of epochs, smoothly reducing the internal effective alpha from its default starting value to near-zero. (It's rare to need to change the starting alpha, but even if you wanted to, just setting a new non-default value at initial model-creation is enough.)
Also: note that later calls to infer_vector() will reuse the epochs specified at the time of model-creation. If nothing is specified, the default epochs=5 will be used - which is often smaller than is best for training or inference. So if you find a larger number of epochs (such as 10, 20 or more) is better for training, remember to also use at least the same number of epochs for inference. (.infer_vector() takes an optional epochs parameter whihc can override any value set at model-contruction.
I know a set bit is a 1 (as opposed to a 0), but I can't seem to find a quick, simple explanation of what a bit set high is (in a 32 bit integer).
Bit set or bit set high means 1.
Example: If i have 8 bit number 10101010.
Now i want to check 5th bit is set or set high there i will see 5th bit if it is 1 it means it is set or set high otherwise it is not set or set low same in the case of 32 bit.
Visit it for programmatically example too LINK
High, Set and 1 are synonymous in binary computing.
High is just the term more likely to be used by someone dealing with the hardware implementation, and refers to a higher voltage (low being the complement) representing the bit.
E.g. from https://www.electronics-tutorials.ws/binary/bin_1.html:
Generally, a logic “1” represents a higher voltage, such as 5 volts,
which is commonly referred to as a HIGH value, while a logic “0”
represents a low voltage, such as 0 volts or ground, and is commonly
referred to as a LOW value. These two discrete voltage levels
representing the digital values of “1’s” (one’s) and “0’s” (zero’s)
are commonly called: BInary digiTS, and in digital and computational
circuits and applications they are normally referred to as binary
BITS.
I have seen threads with similar questions/problems but I have not found this very issue.
Suppose I train a NN with following cost function:
J(theta) = 1/m * sum(sum( -y * log(h(x)) - ( 1 - y ) * log(1-h(x)) ))
and also use sigmoid function as the activation function.
Now, e.g. for cancer detection, for a CV test I get 0.6 Precision and 0.6 Recall. If I want to get another Ratio of Precision and Recall (e.g. lower Precision but higher Recall) I can just change the threshold of a prediction function (i.e. h(output_layer) > threshold). I guess I could also:
- change the NN architecture,
- change the training set,
- change regularization parameter and I would get a different result.
But what if I do NOT want to change any architecture of the NN. Is changing the threshold of the predict function really smart? I see it like that: We train our NN with the sigmoid function (that kind of checks if an activiation of a certain node is below or above 0.5, roughly speaking). And then, after we trained the network with this lower-or-higher-than-0.5 approach, we change the last prediction threshold to some other value.
I do not think that this would be the optimal Precision/Recall Ratio (or F1 Score) that is possible with a certain training set and NN architecture. Or in other words, I do not think we 'walk along' the optimal ROC Curve. Is that correct?
My 2 thoughts on how to come up with a better solution:
1.) Change the activation function. Either to a completly different function or shift the sigmoid function (e.g. sigmoid new = 0.1 + sigmoid original). So I would also get more activation and I guess more Recall in the end.
2.) Change the Cost function (!). E.g. to
J(theta) = 1/m * sum(sum( ALPHA* -y * log(h(x)) - ( 1 - y ) * log(1-h(x)) )). With this Alpha (Scalar) I could punish the -y * log(h(x)) error more (alpha >1) or less (alpha <1). But would I need to also change the backpropagation and/or gradient calculation if I change the costfunction?
I'd appreciate every help, link or thought on this topic :-)
Best, Wolfgang
I've recently made an attempt to implement a basic Q-Learning algorithm in Golang. Note that I'm new to Reinforcement Learning and AI in general, so the error may very well be mine.
Here's how I implemented the solution to an m,n,k-game environment:
At each given time t, the agent holds the last state-action (s, a) and the acquired reward for it; the agent selects a move a' based on an Epsilon-greedy policy and calculates the reward r, then proceeds to update the value of Q(s, a) for time t-1
func (agent *RLAgent) learn(reward float64) {
var mState = marshallState(agent.prevState, agent.id)
var oldVal = agent.values[mState]
agent.values[mState] = oldVal + (agent.LearningRate *
(agent.prevScore + (agent.DiscountFactor * reward) - oldVal))
}
Note:
agent.prevState holds previous state right after taking the action and before the environment responds (i.e. after the agent makes it's move and before the other player makes a move) I use that in place of the state-action tuple, but I'm not quite sure if that's the right approach
agent.prevScore holds the reward to previous state-action
The reward argument represents the reward for current step's state-action (Qmax)
With agent.LearningRate = 0.2 and agent.DiscountFactor = 0.8 the agent fails to reach 100K episodes because of state-action value overflow.
I'm using golang's float64 (Standard IEEE 754-1985 double precision floating point variable) which overflows at around ±1.80×10^308 and yields ±Infiniti. That's too big a value I'd say!
Here's the state of a model trained with a learning rate of 0.02 and a discount factor of 0.08 which got through 2M episodes (1M games with itself):
Reinforcement learning model report
Iterations: 2000000
Learned states: 4973
Maximum value: 88781786878142287058992045692178302709335321375413536179603017129368394119653322992958428880260210391115335655910912645569618040471973513955473468092393367618971462560382976.000000
Minimum value: 0.000000
The reward function returns:
Agent won: 1
Agent lost: -1
Draw: 0
Game continues: 0.5
But you can see that the minimum value is zero, and the maximum value is too high.
It may be worth mentioning that with a simpler learning method I found in a python script works perfectly fine and feels actually more intelligent! When I play with it, most of the time the result is a draw (it even wins if I play carelessly), whereas with the standard Q-Learning method, I can't even let it win!
agent.values[mState] = oldVal + (agent.LearningRate * (reward - agent.prevScore))
Any ideas on how to fix this?
Is that kind of state-action value normal in Q-Learning?!
Update:
After reading Pablo's answer and the slight but important edit that Nick provided to this question, I realized the problem was prevScore containing the Q-value of previous step (equal to oldVal) instead of the reward of the previous step (in this example, -1, 0, 0.5 or 1).
After that change, the agent now behaves normally and after 2M episodes, the state of the model is as follows:
Reinforcement learning model report
Iterations: 2000000
Learned states: 5477
Maximum value: 1.090465
Minimum value: -0.554718
and out of 5 games with the agent, there were 2 wins for me (the agent did not recognize that I had two stones in a row) and 3 draws.
The reward function is likely the problem. Reinforcement learning methods try to maximize the expected total reward; it gets a positive reward for every time step in the game, so the optimal policy is to play as long as possible! The q-values, which define the value function (expected total reward of taking an action in a state then behaving optimally) are growing because the correct expectation is unbounded. To incentivize winning, you should have a negative reward every time step (kind of like telling the agent to hurry up and win).
See 3.2 Goals and Rewards in Reinforcement Learning: An Introduction for more insight into the purpose and definition of reward signals. The problem you are facing is actually exercise 3.5 in the book.
If I've understood well, in your Q-learning update rule, you are using the current reward and the previous reward. However, the Q-learning rule only uses one reward (x are states and u are actions):
On the other hand, you are assuming that the current reward is the same that Qmax value, which is not true. So probably you are misunderstanding the Q-learning algorithm.
I have a system where I use RS232 to control a lamp that takes an input given in float representing voltage (in the range 2.5 - 7.5). The control then gives a output in the range 0 to 6000 which is the brightness a sensor picks up.
What I want is to be able to balance the system so that I can specify a brightness value, and the system should balance in on a voltage value that achieves this.
Is there some standard algorithm or technique to find what the voltage input should be in order to get a specific output? I was thinking of an an algorithm which iteratively tries values and from each try it determines some new value which should be better in order to achieve the determined output value. (in my case that is 3000).
The voltage values required tend to vary between different systems and also over the lifespan of the lamp, so this should preferably be done completely automatic.
I am just looking for a name for a technique or algorithm, but pseudo code works just as well. :)
Calibrate the system on initial run by trying all voltages between 2.5 and 7.5 in e.g. 0.1V increments, and record the sensor output.
Given e.g. 3000 as a desired brightness level, pick the voltage that gives the closest brightness then adjust up/down in small increments based on the sensor output until the desired brightness is achieved. From time to time (based on your calibrated values becoming less accurate) recalibrate.
After some more wikipedia browsing I found this:
Control loop feedback mechanism:
previous_error = setpoint - actual_position
integral = 0
start:
error = setpoint - actual_position
integral = integral + (error*dt)
derivative = (error - previous_error)/dt
output = (Kp*error) + (Ki*integral) + (Kd*derivative)
previous_error = error
wait(dt)
goto start
[edit]
By removing the "integral" component and tweaking the weights (Ki and Kd), the loop works perfectly.
I am not at all into physics, but if you can assume that the relationship between voltage and brightness is somewhat close to linear, you can use a standard binary search.
Other than that, this reminds me of the inverted pendulum, which is one of the standard examples for the use of fuzzy logic.