How does choosing between pre and post zero padding of sequences impact results - performance

I'm working on an NLP sequence labelling problem. My data consists of variable length sequences (w_1, w_2, ..., w_k) with corresponding labels (l_1, l_2, ..., l_k) (in this case the task is named entity extraction).
I intend to solve the problem using Recurrent Neural Networks. As the sequences are of variable length I need to pad them (I want batch size >1). I have the option of either pre zero padding them, or post zero padding them. I.e. either I make every sequence (0, 0, ..., w_1, w_2, ..., w_k) or (w_1, w_2, ..., w_k, 0, 0, ..., 0) such that the lenght of each sequence is the same.
How does the choice between pre- and post padding impact results?
It seems like pre padding is more common, but I can't find an explanation of why it would be better. Due to the nature of RNNs it feels like an arbitrary choice for me, since they share weights across time steps.

Commonly in RNN's, we take the final output or hidden state and use this to make a prediction (or do whatever task we are trying to do).
If we send a bunch of 0's to the RNN before taking the final output (i.e. 'post' padding as you describe), then the hidden state of the network at the final word in the sentence would likely get 'flushed out' to some extent by all the zero inputs that come after this word.
So intuitively, this might be why pre-padding is more popular/effective.

This paper (https://arxiv.org/pdf/1903.07288.pdf) studied the effect of padding types on LSTM and CNN. They found that post-padding achieved substantially lower accuracy (nearly half) compared to pre-padding in LSTMs, although there wasn't a significant difference for CNNs (post-padding was only slightly worse).
A simple/intuitive explanation for RNNs is that, post-padding seems to add noise to what has been learned from the sequence through time, and there aren't more timesteps for the RNN to recover from this noise. With pre-padding, however, the RNN is better able to adjust to the added noise of zeros at the beginning as it learns from the sequence through time.
I think more thorough experiments are needed in the community for more detailed mechanistic explanations on how padding affects performance.
I always recommend using pre-padding over post-padding, even for CNNs, unless the problem specifically requires post-padding.

Related

Ising 2D Optimization

I have implemented a MC-Simulation of the 2D Ising model in C99.
Compiling with gcc 4.8.2 on Scientific Linux 6.5.
When I scale up the grid the simulation time increases, as expected.
The implementation simply uses the Metropolis–Hastings algorithm.
I tried to find out a way to speed up the algorithm, but I haven't any good idea ?
Are there some tricks to do so ?
As jimifiki wrote, try to do a profiling session.
In order to improve on the algorithmic side only, you could try the following:
Lookup Table:
When calculating the energy difference for the Metropolis criteria you need to evaluate the exponential exp[-K / T * dE ] where K is your scaling constant (in units of Boltzmann's constant) and dE the energy-difference between the original state and the one after a spin-flip.
Calculating exponentials is expensive
So you simply build a table beforehand where to look up the possible values for the dE. There will be (four choose one plus four choose two plus four choose three plus four choose four) possible combinations for a nearest-neightbour interaction, exploit the problem's symmetry and you get five values fordE: 8, 4, 0, -4, -8. Instead of using the exp-function, use the precalculated table.
Parallelization:
As mentioned before, it is possible to parallelize the algorithm. To preserve the physical correctness, you have to use a so-called checkerboard concept. Consider the two-dimensional grid as a checkerboard and compute only the white cells parallel at once, then the black ones. That should be clear, considering the nearest-neightbour interaction which introduces dependencies of the values.
Use GPGPU:
You can also implement the simulation on a GPGPU, e.g. using CUDA, if you're already working on C99.
Some tips:
- Don't forget to align C99-structs properly.
- Use linear Arrays, not that nested ones. Aligned memory is normally faster to access, if done properly.
- Try to let the compiler do loop-unrolling, etc. (gcc special options, not default on O2)
Some more information:
If you look for an efficient method to calculate the critical point of the system, the method of choice would be finite-size scaling where you simulate at different system-sizes and different temperature, then calculate a value which is system-size independet at the critical point, therefore an intersection point of the corresponding curves (please see the theory to get a detailed explaination)
I hope I was helpful.
Cheers...
It's normal that your simulation times scale at least with the square of the size. Isn't it?
Here some subjestions:
If you are concerned with thermalization issues, try to use parallel tempering. It can be of help.
The Metropolis-Hastings algorithm can be made parallel. You could try to do it.
Check you are not pessimizing the code.
Are your spin arrays of ints? You could put many spins on the same int. It's a lot of work.
Moreover, remember what Donald taught us:
premature optimisation is the root of all evil
Before optimising you should first understand where your program is slow. This is called profiling.

How to fuzzily search for a dictionary word?

I have read a lot of threads here discussing edit-distance based fuzzy-searches, which tools like Elasticsearch/Lucene provide out of the box, but my problem is a bit different. Suppose I have a dictionary of words, {'cat', 'cot', 'catalyst'}, and a character similarity relation f(x, y)
f(x, y) = 1, if characters x and y are similar
= 0, otherwise
(These "similarities" can be specified by the programmer)
such that, say,
f('t', 'l') = 1
f('a', 'o') = 1
f('f', 't') = 1
but,
f('a', 'z') = 0
etc.
Now if we have a query 'cofatyst', the algorithm should report the following matches:
('cot', 0)
('cat', 0)
('catalyst', 0)
where the number is the 0-based starting index of the match found. I have tried the Aho-Corasick algorithm, and while it works great for exact matching and in the case when a character has relatively less number of "similar" characters, its performance drops exponentially as we increase the number of similar characters for a character. Can anyone point me to a better way of doing this? Fuzziness is an absolute necessity, and it must take in to account character similarities(i.e., not blindly depend on just edit-distances).
One thing to note is that in the wild, the dictionary is going to be really large.
I might try to use the cosine similarity using the position of each character as a feature and mapping the product between features using a match function based on your character relations.
Not a very specific advise, I know, but I hope it helps you.
edited: Expanded answer.
With the cosine similarity, you will compute how similar two vectors are. In your case the normalisation might not make sense. So, what I would do is something very simple (I might be oversimplifying the problem): First, see the matrix of CxC as a dependency matrix with the probability that two characters are related (e.g., P('t' | 'l') = 1). This will also allow you to have partial dependencies to differentiate between perfect and partial matches. After this I will compute, for each position the probability that the letter from each word is not the same (using the complement of P(t_i, t_j)) and then you can just aggregate the results using a sum.
It will count the number of terms that are different for a specific pair of words, and it allows you to define partial dependencies. Furthermore, the implementation is very simple and should scale well. This is why I am not sure if I misunderstood your question.
I am using Fuse JavaScript Library for a project of mine. It is a javascript file which works on JSON dataset. It is quite fast. Have a look at it.
It has implemented a full Bitap algorithm, leveraging a modified version of the Diff, Match & Patch tool by Google(from his site).
The code is simple to understand the algorithm implementation done.

How does a finite state machine perform division?

I am taking a course on models of computation and currently we are doing finite state machines. One my tasks is to draw out a FSM that performs division of 3; to simplify the model the machine only accepts numbers multiple of 3. I am not sure how this exactly works, especially since I imagine FSM putting out only single binary values. Could you guys give examples (division by 2 or 4) or hints on how to approach this?
This is what you need, I think (sorry about the bad picture). The 'E' represents epsilon/lambda/no-output. The label of the edges denotes 'input/output'. For each symbol read there is also a corresponding output which may be lambda (no output).

Algorithms to represent a set of integers with only one integer

This may not be a programming question but it's a problem that arised recently at work. Some background: big C development with special interest in performance.
I've a set of integers and want to test the membership of another given integer. I would love to implement an algorithm that can check it with a minimal set of algebraic functions, using only a integer to represent the whole space of integers contained in the first set.
I've tried a composite Cantor pairing function for instance, but with a 30 element set it seems too complicated, and focusing in performance it makes no sense. I played with some operations, like XORing and negating, but it gives me low estimations on membership. Then I tried with successions of additions and finally got lost.
Any ideas?
For sets of unsigned long of size 30, the following is one fairly obvious way to do it:
store each set as a sorted array, 30 * sizeof(unsigned long) bytes per set.
to look up an integer, do a few steps of a binary search, followed by a linear search (profile in order to figure out how many steps of binary search is best - my wild guess is 2 steps, but you might find out different, and of course if you test bsearch and it's fast enough, you can just use it).
So the next question is why you want a big-maths solution, which will tell me what's wrong with this solution other than "it is insufficiently pleasing".
I suspect that any big-math solution will be slower than this. A single arithmetic operation on an N-digit number takes at least linear time in N. A single number to represent a set can't be very much smaller than the elements of the set laid end to end with a separator in between. So even a linear search in the set is about as fast as a single arithmetic operation on a big number. With the possible exception of a Goedel representation, which could do it in one division once you've found the nth prime number, any clever mathematical representation of sets is going to take multiple arithmetic operations to establish membership.
Note also that there are two different reasons you might care about the performance of "look up an integer in a set":
You are looking up lots of different integers in a single set, in which case you might be able to go faster by constructing a custom lookup function for that data. Of course in C that means you need either (a) a simple virtual machine to execute that "function", or (b) runtime code generation, or (c) to know the set at compile time. None of which is necessarily easy.
You are looking up the same integer in lots of different sets (to get a sequence of all the sets it belongs to), in which case you might benefit from a combined representation of all the sets you care about, rather than considering each set separately.
I suppose that very occasionally, you might be looking up lots of different integers, each in a different set, and so neither of the reasons applies. If this is one of them, you can ignore that stuff.
One good start is to try Bloom Filters.
Basically, it's a probabilistic data structure that gives you no false negative, but some false positive. So when an integer matches a bloom filter, you then have to check if it really matches the set, but it's a big speedup by reducing a lot the number of sets to check.
if i'd understood your correctly, python example:
>>> a=[1,2,3,4,5,6,7,8,9,0]
>>>
>>>
>>> len_a = len(a)
>>> b = [1]
>>> if len(set(a) - set(b)) < len_a:
... print 'this integer exists in set'
...
this integer exists in set
>>>
math base: http://en.wikipedia.org/wiki/Euler_diagram

Will this obfuscation algorithm for a URL shortener work?

DISCLAIMER: I am not asking how to make a URL shortener (I have already implemented the "bijective function" answer found HERE that uses a base-62 encoded string). Instead, I want to expand this implementation to obfuscate the generated string so that it is both:
A) not an easily guessable sequence, and
B) still bijective.
You can easily randomize your base-62 character set, but the problem is that it still increments like any other number in any other base. For example, one possible incremental progression might be {aX9fgE, aX9fg3, aX9fgf, aX9fgR, … ,}
I have come up with an obfuscation technique that I am pleased with in terms of requirement A), but I'm only partially sure that it satisfies B). The idea is this:
The only thing that is guaranteed to change in the incremental approach is the "1's place" (I'll use decimal terminology for practicality reasons). In the sample progression I gave earlier, that would be {E, 3, f, R, …}. So if each character in the base-62 set had its own unique offset number (say, its distance from the "zero character"), then you could apply the offset of the "1's place" character to the rest of the string.
For instance, let's assume a base-5 set with characters {A, f, 9, p, Z, 3} (in ascending order from 0 to 5). Each one would then have a unique offset of 0 to 5 respectively. Counting would look like {A, f, 9, p, Z, 3, fA, ff, f9, fp, …} and so on. So the algorithm, when given a value of fZ3p, would look at the p and, having an offset of +3, would permute the string into Zf9p (assuming the base-5 set is a circular array). The next incremental number would be fZ3Z, and with Z's offset being +4, the algorithm returns 39pZ. These permutated results would be handed off to the user as his/her "unique URL", who would never see the actual base-62 encoded string.
This approach certainly seems reversible; just look at the last character, and perform the same permutation with the negative offset. And I'm thinking that for this reason, it has to still be bijective. But I don't know if this is necessarily true? Are there any edge/corner cases I'm not considering?
EDIT : My intentions are more heavily weighed towards the length of the shortened-URL rather than the security of the pattern. I realize there are plenty of solutions involving cryptographic functions, block ciphers, etc. But I would like to emphasize that I am not asking the best way to achieve A), but rather, "is my offset-approach satisfying B)".
Any holes you can find would be appreciated.
If you honestly want them to be hard to guess, keep it simple.
Start with a normal encryption algorithm running in counter mode. When you get a URL to shorten, increment your counter, encrypt it, convert the result to something using printable characters (e.g., base 64) and put the original URL and the shortened version into your table so you can get the original URL from the shortened version when needed.
The only real question at that point is what encryption algorithm to use. That, in turn, depends on your threat model. I don't see exactly what you gain by making shortened URLs hard to guess, so I'm a bit uncertain about the threat model.
If you want to make it mildly difficult to guess, you could use something like a 40-bit version of RC4. This is pretty easy to break, but enough to keep most people from bothering.
If you want a bit more security, you could step up to DES. That's been broken, but even at this late date breaking it is quite a bit of work.
If you want more security than that, you can use AES.
Note that as you increase the security, the shortened URL gets longer though. RC4-40 starts out with 5 bytes, DES 7 bytes, and AES with 32 bytes. Depending on how you convert to printable text, that's going to expand at least a little.
Another option is to use the Luby-Rackoff construction (see also here), which is a way to generate a pseudo-random permutation from a pseudo-random function.
You just have to pick a "round function" F. F must take as input a key K and a block of bits half the size of what you are encoding. F must produce as output a block of bits also half the size of whatever you are encoding.
Then you just run the Luby-Rackoff construction (aka. "Feistel network") for four rounds, each round using a different K.
The construction guarantees that the result is a bijective map, and it will be hard to invert provided that F is hard to invert.
I tried to solve the same problem (in php) and ended up with those functions:
hashing the row id with a kind of feistel algo
applying a bijective function to compress the integer
So for the A): it's not easily guessable (to me) as you cant increment a string to get the next record without an algo
And for the B): for what I understand it's 100% bijective.
Thanks to #Nemo for naming the feistel network, which lead me to the first function i have linked to.
If you're trying to avoid people crawling the URLs, I think Nick Johnson has the right idea, that you need to make sure your URL space is not dense.
Here's a simple idea: take your URL, and prepend a few random characters to it. Then run it through a compression algorithm -- I'd try range encoding (you can probably specify the basis if you find a good library). This should be decompressible to the original form, and should both impact locality and make the encoded space more sparse.
That said, I imagine nearly all URL shorteners out there keep a hash table with state on the server side. How else are you going to losslessly compress a hundred-character URL into 5 or 6 characters?

Resources