I'm working on an NLP sequence labelling problem. My data consists of variable length sequences (w_1, w_2, ..., w_k) with corresponding labels (l_1, l_2, ..., l_k) (in this case the task is named entity extraction).
I intend to solve the problem using Recurrent Neural Networks. As the sequences are of variable length I need to pad them (I want batch size >1). I have the option of either pre zero padding them, or post zero padding them. I.e. either I make every sequence (0, 0, ..., w_1, w_2, ..., w_k) or (w_1, w_2, ..., w_k, 0, 0, ..., 0) such that the lenght of each sequence is the same.
How does the choice between pre- and post padding impact results?
It seems like pre padding is more common, but I can't find an explanation of why it would be better. Due to the nature of RNNs it feels like an arbitrary choice for me, since they share weights across time steps.
Commonly in RNN's, we take the final output or hidden state and use this to make a prediction (or do whatever task we are trying to do).
If we send a bunch of 0's to the RNN before taking the final output (i.e. 'post' padding as you describe), then the hidden state of the network at the final word in the sentence would likely get 'flushed out' to some extent by all the zero inputs that come after this word.
So intuitively, this might be why pre-padding is more popular/effective.
This paper (https://arxiv.org/pdf/1903.07288.pdf) studied the effect of padding types on LSTM and CNN. They found that post-padding achieved substantially lower accuracy (nearly half) compared to pre-padding in LSTMs, although there wasn't a significant difference for CNNs (post-padding was only slightly worse).
A simple/intuitive explanation for RNNs is that, post-padding seems to add noise to what has been learned from the sequence through time, and there aren't more timesteps for the RNN to recover from this noise. With pre-padding, however, the RNN is better able to adjust to the added noise of zeros at the beginning as it learns from the sequence through time.
I think more thorough experiments are needed in the community for more detailed mechanistic explanations on how padding affects performance.
I always recommend using pre-padding over post-padding, even for CNNs, unless the problem specifically requires post-padding.
I would like to understand the general idea behind hybrid modelling (in particular state events) from a numerical point of view (although I am not a mathematician :)). Given the following Modelica model:
model BouncingBall
constant Real g=9.81
Real h(start=1);
Real v(start=0);
equation
der(h)=v;
der(v)=-g;
algorithm
when h < 0 then
reinit(v,-pre(v));
end when;
end BouncingBall;
I understand the concept of when and reinit.
The equation in the when statement are only active when the condition become true right?
Let's assume that the ball would hit the floor at exactly 2sec. Since I am using multi-step solver does that mean that the solver "goes beyond 2 seconds", recognizes that h<0 (lets assume at simulation time = 2.5sec , h = -0.7). What does this mean "The time for the event is searched using a crossing function? Is there a simple explanation(example)?
Is the solver now going back? Taking a smaller step-size?
What does the pre() operation mean in that context?
noEvent(): "Expressions are taken literally instead of generating crossing functions. Since there is no crossing function, there is no requirement tat the expression can be evaluated beyond the event limit": What does that mean? Given the same example with the bouncing ball: The solver detects at time 2.5 that h = 0.7. Whats the difference between with and without noEvent()?
Yes, the body of when is only executed at events.
Simple view: The solver takes steps, and then uses a continuous extension to generate a (smooth) interpolation formula for the previous step. That interpolation formula can be used to generate a plot, and also for finding the first point where h has crossed zero (likely 2.000000001). An event iteration is then done at that interpolated point - and afterwards the solver is restarted.
I wouldn't say that the solver goes back. It takes a partial step and then continues forward. Some solvers need to reduce the step-size a lot after the event - others don't.
pre(x) is set to the value of x before the event.
noEvent(h<0) basically means evaluate the expression as written without all the bells-and-whistles of crossing functions. You cannot use when noEvent(h<0) then
There are many additional point:
If you are familiar with Sturm-sequences or control theory you might realize that it is not necessary to interpolate a formula to determine if it crossed zero or not in an interval (and some tools use that). The fact that the function is not necessarily smooth makes it a bit more complicated, and also means that derivative-tests cannot be used.
How much the solver is reset depends on the kind of solver. One-step solvers (Runge-Kutta) can be restarted directly as if virtually nothing happened, whereas multi-step solvers (BDF/Adams - such as dassl/lsodar/cvode) need to start with lower order and smaller step-size.
Typically, the re-estimation iterative procedure stops when lambda.bar - lambda is less than some epsilon value.
How exactly does one determine this epsilon value? I often only see is written as the general epsilon symbol in papers, and never the actual value used, which I assume would change depending on the data.
So, for instance, if the lambda value of my first iteration was 5*10^-22, second iteration was 1.3*10^-15, third was 8.45*10^-15, fourth was 1.65*10^-14, etc., how would I determine when the algorithm needed no more iteratons?
Moreover, what if I were to apply the same alogrithm to a different datset? would I need to change my epsilon definitions?
Sorry for the long question. Pretty puzzled by it... :)
"how would I determine when the algorithm needed no more iteratons?"
When you get a "good-enough" result within a reasonable amount of time. ;-)
"Moreover, what if I were to apply the same alogrithm to a different datset? would I need to
change my epsilon definitions?"
Yes, most probably.
If you can afford it, you can just let it iterate until the updated value <= the old value (it could be < due to floating point error). I would be inclined to go with this until I ran out of patience or cpu budget.
I am taking a course on models of computation and currently we are doing finite state machines. One my tasks is to draw out a FSM that performs division of 3; to simplify the model the machine only accepts numbers multiple of 3. I am not sure how this exactly works, especially since I imagine FSM putting out only single binary values. Could you guys give examples (division by 2 or 4) or hints on how to approach this?
This is what you need, I think (sorry about the bad picture). The 'E' represents epsilon/lambda/no-output. The label of the edges denotes 'input/output'. For each symbol read there is also a corresponding output which may be lambda (no output).
I was thinking earlier today about an idea for a small game and stumbled upon how to implement it. The idea is that the player can make a series of moves that cause a little effect, but if done in a specific sequence would cause a greater effect. So far so good, this I know how to do. Obviously, I had to make it be more complicated (because we love to make it more complicated), so I thought that there could be more than one possible path for the sequence that would both cause greater effects, albeit different ones. Also, part of some sequences could be the beggining of other sequences, or even whole sequences could be contained by other bigger sequences. Now I don't know for sure the best way to implement this. I had some ideas, though.
1) I could implement a circular n-linked list. But since the list of moves never end, I fear it might cause a stack overflow ™. The idea is that every node would have n children and upon receiving a command, it might lead you to one of his children or, if no children was available to such command, lead you back to the beggining. Upon arrival on any children, a couple of functions would be executed causing the small and big effect. This might, though, lead to a lot of duplicated nodes on the tree to cope up with all the possible sequences ending on that specific move with different effects, which might be a pain to maintain but I am not sure. I never tried something this complex on code, only theoretically. Does this algorithm exist and have a name? Is it a good idea?
2) I could implement a state machine. Then instead of wandering around a linked list, I'd have some giant nested switch that would call functions and update the machine state accordingly. Seems simpler to implement, but... well... doesn't seem fun... nor ellegant. Giant switchs always seem ugly to me, but would this work better?
3) Suggestions? I am good, but I am far inexperienced. The good thing of the coding field is that no matter how weird your problem is, someone solved it in the past, but you must know where to look. Someone might have a better idea than those I had, and I really wanted to hear suggestions.
I'm not absolutely completely sure that I understand exactly what you're saying, but as an analagous situation, say someone's inputting an endless stream of numbers on the keyboard. '117' is a magic sequence, '468' is another one, '411799' is another (which contains the first one).
So if the user enters:
55468411799
you want to fire 'magic events' at the *s:
55468*4117*99*
or something like that, right? If that's analagous to the problem you're talking about, then what about something like (Java-like pseudocode):
MagicSequence fireworks = new MagicSequence(new FireworksAction(), 1, 1, 7);
MagicSequence playMusic = new MagicSequence(new MusicAction(), 4, 6, 8);
MagicSequence fixUserADrink = new MagicSequence(new ManhattanAction(), 4, 1, 1, 7, 9, 9);
Collection<MagicSequence> sequences = ... all of the above ...;
while (true) {
int num = readNumberFromUser();
for (MagicSequence seq : sequences) {
seq.handleNumber(num);
}
}
while MagicSequence has something like:
Action action = ... populated from constructor ...;
int[] sequence = ... populated from constructor ...;
int position = 0;
public void handleNumber(int num) {
if (num == sequence[position]) {
// They've entered the next number in the sequence
position++;
if (position == sequence.length) {
// They've got it all!
action.fire();
position = 0; // Or disable this Sequence from accepting more numbers if it's a once-off
}
} else {
position = 0; // missed a number, start again!
}
}
You might want to implement a state machine anyway, but you don't have to hardcode state transitions.
Try to make a graph of states, where link between state A to state B will mean A can lead to B.
Then you can traverse graph at runtime to find where player goes.
Edit: You can define graph node as:
-state-id
-list of links to other states,
where every link defines:
-state-id
-precondition, a list of states what must be visited before going to this state
What you're describing sounds very similar to the technology tree in a game live Civilization.
I don't know how the Civ authors built theirs, but I'd be inclined to use a multigraph to represent possible 'moves' - there will be some you can start at with no 'experience', and once you're in them, there will be multiple paths through to the end.
Draw-out what potential options you can have at each stage of the game, and then draw lines going from some options to others.
That should give you a start on implementation, as graphs are [relatively] easy concepts to implement and utilize.
Sounds like a neural network. You could create one and train it to recognize the patterns that cause the various effects you are looking for.
What you're describing sounds somewhat similar to a dependency graph or a word graph. You might look into those.
#Cowan, #Javier: Nice idea, mind if I add to it?
Let the MagicSequence objects listen to the incoming stream of user input, that is notify them of the input (broadcast) and let each of them add the input to there internal input stream. This stream is cleared when the input is not the expected next input in the pattern that would have the MagicSequence fire its action. As soon as the pattern is completed, fire the action and clear the internal input stream.
Optimize this by only feeding input to the MagicSequences that are waiting for it. This could be done two ways:
You have an object that lets all MagicSequences connect with events that correspond with numbers in their patterns. MagicSequence(1,1,7) would add itself to got1 and got7, for example:
UserInput.got1 += MagicSequnece[i].SendMeInput;
You could optimize this such that after each input MagicSequences deregister from invalid events and register with valid ones.
create a small state machine for each effect that you'd want. at each user action, 'broadcast' it to all state machines. most of then won't care, but some will advance, or maybe go backwards. when one of them reaches it's goal, produce the desired effect.
to keep the code neat, don't hardcode the state machines, instead build a simple data structure that encodes the state graph: each state is a node with a list of interesting events, each one points to the next state's node. Each machine's state is simply a reference to the appropriate state node.
edit: It seems Cowan's advice is equivalent to this, but he optimises his state machines to express only simple sequences. seems enough for your specific problem, but more complex conditions could need a more general solution.