How does Gensim implement subsampling in Word2Vec? - gensim

I am trying to reimplement wor2vec in pytorch. I implemented subsamping according to the code of the original paper. However, I am trying to understand how subsampling is implemented in Gensim. I looked at the source code, but I did not manage to grasp how it reconnects to the original paper.
Thanks a lot in advance.

The key line is:
https://github.com/RaRe-Technologies/gensim/blob/e391f0c25599c751e127dde925e062c7132e4737/gensim/models/word2vec_inner.pyx#L543
if c.sample and word.sample_int < random_int32(&c.next_random):
continue
If c.sample tests if frequent-word downsampling is enabled at all (any non-zero value).
The word.sample_int is a value, per vocabulary word, that was precalculated during the vocabulary-discovery phase. It's essentially the 0.0-to-1.0 probability that a word should be kept, but scaled to the range 0-to-(2^32-1).
Most words, that are never down-sampled, simply have the value (2^32-1) there - so no matter what random int was just generated, that random int is less than the threshold, and the word is retained.
The few most-frequent words have other scaled values there, and thus sometimes the random int generated is larger than their sample_int. Thus, that word is, in that one training-cycle, skipped via the continue to the next word in the sentence. (That one word doesn't get made part of effective_words, this one time.)
You can see the original assignment & precalculation of the .sample_int values, per unique vocabulary word, at and around:
https://github.com/RaRe-Technologies/gensim/blob/e391f0c25599c751e127dde925e062c7132e4737/gensim/models/word2vec.py#L1544

Related

How does choosing between pre and post zero padding of sequences impact results

I'm working on an NLP sequence labelling problem. My data consists of variable length sequences (w_1, w_2, ..., w_k) with corresponding labels (l_1, l_2, ..., l_k) (in this case the task is named entity extraction).
I intend to solve the problem using Recurrent Neural Networks. As the sequences are of variable length I need to pad them (I want batch size >1). I have the option of either pre zero padding them, or post zero padding them. I.e. either I make every sequence (0, 0, ..., w_1, w_2, ..., w_k) or (w_1, w_2, ..., w_k, 0, 0, ..., 0) such that the lenght of each sequence is the same.
How does the choice between pre- and post padding impact results?
It seems like pre padding is more common, but I can't find an explanation of why it would be better. Due to the nature of RNNs it feels like an arbitrary choice for me, since they share weights across time steps.
Commonly in RNN's, we take the final output or hidden state and use this to make a prediction (or do whatever task we are trying to do).
If we send a bunch of 0's to the RNN before taking the final output (i.e. 'post' padding as you describe), then the hidden state of the network at the final word in the sentence would likely get 'flushed out' to some extent by all the zero inputs that come after this word.
So intuitively, this might be why pre-padding is more popular/effective.
This paper (https://arxiv.org/pdf/1903.07288.pdf) studied the effect of padding types on LSTM and CNN. They found that post-padding achieved substantially lower accuracy (nearly half) compared to pre-padding in LSTMs, although there wasn't a significant difference for CNNs (post-padding was only slightly worse).
A simple/intuitive explanation for RNNs is that, post-padding seems to add noise to what has been learned from the sequence through time, and there aren't more timesteps for the RNN to recover from this noise. With pre-padding, however, the RNN is better able to adjust to the added noise of zeros at the beginning as it learns from the sequence through time.
I think more thorough experiments are needed in the community for more detailed mechanistic explanations on how padding affects performance.
I always recommend using pre-padding over post-padding, even for CNNs, unless the problem specifically requires post-padding.

I need help optimizing this compression algorithm I came up with on my own

I tried coming up with a compression algorithm. I do little bit about compression theories and so am aware that this scheme that I have come up with could very well never achieve compression at all.
Currently it works only for a string with no consecutive repeating letters/digits/symbols. Once properly established I hope to extrapolate it to binary data etc. But first the algorithm:
Assuming there are only 4 letters: a,b,c,d; we create a matrix/array corresponding to the letters. Whenever a letter is encountered, the corresponding index is incremented so that the index of the last letter encountered is always largest. We incremement an index by 2 if it was originally zero. If it was not originally zero then we increment it by 2+(the second largest element in the matrix). An example to clarify:
Array = [a,b,c,d]
Initial state = [0,0,0,0]
Letter = a
New state = [2,0,0,0]
Letter = b
New state = [2,4,0,0]
.
.c
.d
.
New state = [2,4,6,8]
Letter = a
New state = [12,4,6,8]
//Explanation for the above state: 12 because Largest - Second Largest - 2 = Old value
Letter = d
New state = [12,4,6,22]
and so on...
Decompression is just this logic in reverse.
A rudimentary implementation of compression (in python):
(This function is very rudimentary so not the best kind of code...I know. I can optimize it once I get the core algorithm correct.)
def compress(text):
matrix = [0]*95 #we are concerned with 95 printable chars for now
for i in text:
temp = copy.deepcopy(matrix)
temp.sort()
largest = temp[-1]
if matrix[ord(i)-32] == 0:
matrix[ord(i)-32] = largest+2
else:
matrix[ord(i)-32] = largest+matrix[ord(i)-32]+2
return matrix
The returned matrix is then used for decompression. Now comes the tricky part:
I can't really call this compression at all because each number in the matrix generated from the function are of the order of 10**200 for a string of length 50000. So storing the matrix actually takes more space than storing the original string. I know...totally useless. But I had hoped prior to doing all this that I can use the mathematical properties of a matrix to effectively represent it in some kind of mathematical shorthand. I have tried many possibilities and failed. Some things that I tried:
Rank of the matrix. Failed because not unique.
Denote using the mod function. Failed because either the quotient or the remainder
Store each integer as a generator using pickle.
Store the matrix as a bitmap file but then the integers are too large to be able to store as color codes.
Let me iterate again that the algorithm could be optimized. e.g. instead of adding 2 we could add 1 and proceed. But don't really result in any compression. Same for the code. Minor optimizations later...first I want to improve the main algorithm.
Furthermore, it is very likely that this product of a mediocre and idle mind like myself could never be able to achieve compression after all. In which case, I would then like your help and ideas on what this could probably be useful in.
TL;DR: Check coded parts which depict a compression algorithm. The compressed result is longer than the original string. Can this be fixed? If yes, how?
PS: I have the entire code on my PC. Will create a repo on github and upload in some time.
Compression is essentially a predictive process. Look for patterns in the input and use them to encode the more likely next character(s) more efficiently than the less likely. I can't see anything in your algorithm that tries to build a predictive model.

Will this obfuscation algorithm for a URL shortener work?

DISCLAIMER: I am not asking how to make a URL shortener (I have already implemented the "bijective function" answer found HERE that uses a base-62 encoded string). Instead, I want to expand this implementation to obfuscate the generated string so that it is both:
A) not an easily guessable sequence, and
B) still bijective.
You can easily randomize your base-62 character set, but the problem is that it still increments like any other number in any other base. For example, one possible incremental progression might be {aX9fgE, aX9fg3, aX9fgf, aX9fgR, … ,}
I have come up with an obfuscation technique that I am pleased with in terms of requirement A), but I'm only partially sure that it satisfies B). The idea is this:
The only thing that is guaranteed to change in the incremental approach is the "1's place" (I'll use decimal terminology for practicality reasons). In the sample progression I gave earlier, that would be {E, 3, f, R, …}. So if each character in the base-62 set had its own unique offset number (say, its distance from the "zero character"), then you could apply the offset of the "1's place" character to the rest of the string.
For instance, let's assume a base-5 set with characters {A, f, 9, p, Z, 3} (in ascending order from 0 to 5). Each one would then have a unique offset of 0 to 5 respectively. Counting would look like {A, f, 9, p, Z, 3, fA, ff, f9, fp, …} and so on. So the algorithm, when given a value of fZ3p, would look at the p and, having an offset of +3, would permute the string into Zf9p (assuming the base-5 set is a circular array). The next incremental number would be fZ3Z, and with Z's offset being +4, the algorithm returns 39pZ. These permutated results would be handed off to the user as his/her "unique URL", who would never see the actual base-62 encoded string.
This approach certainly seems reversible; just look at the last character, and perform the same permutation with the negative offset. And I'm thinking that for this reason, it has to still be bijective. But I don't know if this is necessarily true? Are there any edge/corner cases I'm not considering?
EDIT : My intentions are more heavily weighed towards the length of the shortened-URL rather than the security of the pattern. I realize there are plenty of solutions involving cryptographic functions, block ciphers, etc. But I would like to emphasize that I am not asking the best way to achieve A), but rather, "is my offset-approach satisfying B)".
Any holes you can find would be appreciated.
If you honestly want them to be hard to guess, keep it simple.
Start with a normal encryption algorithm running in counter mode. When you get a URL to shorten, increment your counter, encrypt it, convert the result to something using printable characters (e.g., base 64) and put the original URL and the shortened version into your table so you can get the original URL from the shortened version when needed.
The only real question at that point is what encryption algorithm to use. That, in turn, depends on your threat model. I don't see exactly what you gain by making shortened URLs hard to guess, so I'm a bit uncertain about the threat model.
If you want to make it mildly difficult to guess, you could use something like a 40-bit version of RC4. This is pretty easy to break, but enough to keep most people from bothering.
If you want a bit more security, you could step up to DES. That's been broken, but even at this late date breaking it is quite a bit of work.
If you want more security than that, you can use AES.
Note that as you increase the security, the shortened URL gets longer though. RC4-40 starts out with 5 bytes, DES 7 bytes, and AES with 32 bytes. Depending on how you convert to printable text, that's going to expand at least a little.
Another option is to use the Luby-Rackoff construction (see also here), which is a way to generate a pseudo-random permutation from a pseudo-random function.
You just have to pick a "round function" F. F must take as input a key K and a block of bits half the size of what you are encoding. F must produce as output a block of bits also half the size of whatever you are encoding.
Then you just run the Luby-Rackoff construction (aka. "Feistel network") for four rounds, each round using a different K.
The construction guarantees that the result is a bijective map, and it will be hard to invert provided that F is hard to invert.
I tried to solve the same problem (in php) and ended up with those functions:
hashing the row id with a kind of feistel algo
applying a bijective function to compress the integer
So for the A): it's not easily guessable (to me) as you cant increment a string to get the next record without an algo
And for the B): for what I understand it's 100% bijective.
Thanks to #Nemo for naming the feistel network, which lead me to the first function i have linked to.
If you're trying to avoid people crawling the URLs, I think Nick Johnson has the right idea, that you need to make sure your URL space is not dense.
Here's a simple idea: take your URL, and prepend a few random characters to it. Then run it through a compression algorithm -- I'd try range encoding (you can probably specify the basis if you find a good library). This should be decompressible to the original form, and should both impact locality and make the encoded space more sparse.
That said, I imagine nearly all URL shorteners out there keep a hash table with state on the server side. How else are you going to losslessly compress a hundred-character URL into 5 or 6 characters?

A function where small changes in input always result in large changes in output

I would like an algorithm for a function that takes n integers and returns one integer. For small changes in the input, the resulting integer should vary greatly. Even though I've taken a number of courses in math, I have not used that knowledge very much and now I need some help...
An important property of this function should be that if it is used with coordinate pairs as input and the result is plotted (as a grayscale value for example) on an image, any repeating patterns should only be visible if the image is very big.
I have experimented with various algorithms for pseudo-random numbers with little success and finally it struck me that md5 almost meets my criteria, except that it is not for numbers (at least not from what I know). That resulted in something like this Python prototype (for n = 2, it could easily be changed to take a list of integers of course):
import hashlib
def uniqnum(x, y):
return int(hashlib.md5(str(x) + ',' + str(y)).hexdigest()[-6:], 16)
But obviously it feels wrong to go over strings when both input and output are integers. What would be a good replacement for this implementation (in pseudo-code, python, or whatever language)?
A "hash" is the solution created to solve exactly the problem you are describing. See wikipedia's article
Any hash function you use will be nice; hash functions tend to be judged based on these criteria:
The degree to which they prevent collisions (two separate inputs producing the same output) -- a by-product of this is the degree to which the function minimizes outputs that may never be reached from any input.
The uniformity the distribution of its outputs given a uniformly distributed set of inputs
The degree to which small changes in the input create large changes in the output.
(see perfect hash function)
Given how hard it is to create a hash function that maximizes all of these criteria, why not just use one of the most commonly used and relied-on existing hash functions there already are?
From what it seems, turning integers into strings almost seems like another layer of encryption! (which is good for your purposes, I'd assume)
However, your question asks for hash functions that deal specifically with numbers, so here we go.
Hash functions that work over the integers
If you want to borrow already-existing algorithms, you may want to dabble in pseudo-random number generators
One simple one is the middle square method:
Take a digit number
Square it
Chop off the digits and leave the middle digits with the same length as your original.
ie,
1111 => 01234321 => 2342
so, 1111 would be "hashed" to 2342, in the middle square method.
This way isn't that effective, but for a few number of hashes, this has very low collision rates, a uniform distribution, and great chaos-potential (small changes => big changes). But if you have many values, time to look for something else...
The grand-daddy of all feasibly efficient and simple random number generators is the (Mersenne Twister)[http://en.wikipedia.org/wiki/Mersenne_twister]. In fact, an implementation is probably out there for every programming language imaginable. Your hash "input" is something that will be called a "seed" in their terminology.
In conclusion
Nothing wrong with string-based hash functions
If you want to stick with the integers and be fancy, try using your number as a seed for a pseudo-random number generator.
Hashing fits your requirements perfectly. If you really don't want to use strings, find a Hash library that will take numbers or binary data. But using strings here looks OK to me.
Bob Jenkins' mix function is a classic choice, at when n=3.
As others point out, hash functions do exactly what you want. Hashes take bytes - not character strings - and return bytes, and converting between integers and bytes is, of course, simple. Here's an example python function that works on 32 bit integers, and outputs a 32 bit integer:
import hashlib
import struct
def intsha1(ints):
input = struct.pack('>%di' % len(ints), *ints)
output = hashlib.sha1(input).digest()
return struct.unpack('>i', output[:4])
It can, of course, be easily adapted to work with different length inputs and outputs.
Have a look at this, may be you can be inspired
Chaotic system
In chaotic dynamics, small changes vary results greatly.
A x-bit block cipher will take an number and convert it effectively to another number. You could combine (sum/mult?) your input numbers and cipher them, or iteratively encipher each number - similar to a CBC or chained mode. Google 'format preserving encyption'. It is possible to create a 32-bit block cipher (not widely 'available') and use this to create a 'hashed' output. Main difference between hash and encryption, is that hash is irreversible.

Symmetric Bijective String Algorithm?

I'm looking for an algorithm that can do a one-to-one mapping of a string onto another string.
I want an algorithm that given an alphabet I can perform a symmetric mapping function.
For example:
Let's consider that I have the alphabet "A","B","C","D","E","F". I want something like F("ABC") = "CEA" and F("CEA") = "ABC" for every N letter permutation.
Surely, an algorithm like this exists. If you know of an algorithm, please post the name of it and I can research it. If I haven't been clear enough in my request, please let me know.
Thanks in advance.
Edit 1:
I should clarify that I want enough entropy so that F("ABC") would equal "CEA" and F("CEA") = "ABC" but then I do NOT want F("ABD") to equal "CEF". Notice how two input letters stayed the same and the two corresponding output letters stayed the same?
So a Caesar Cipher/ROT13 or shuffling the array would not be sufficient. However, I don't need any "real" security. Just enough entropy for the output of the function to appear random. Weak encryption algorithms welcome.
Just create an array of objects that contain 2 fields -- a letter, and a random number. Sort the array. By the random numbers. This creates a mapping where the i-th letter of the alphabet now maps to the i-th letter in the array.
If simple transposition or substitution isn't quite enough, it sounds like you want to advance to a polyalphabetic cipher. The Vigenère cipher is extremely easy to implement in code, but is still difficult to break without using a computer.
I suggest the following.
Perform a dense coding of the input to positive integers - with an alphabet size of n and string length of m you can code the string into integers between zero and n^m - 1. In your example this would be the range [0,215]. Now perform a fixed involution on the encoded number and decode it again.
Take RC4, settle for some password, and you're done. (Not that this would be very safe.)
Take the set of all permutations of your alphabet, shuffle it, and map the first half of the set onto the second half. Bad for large alphabets, of course. :)
Nah, thought that over, I forgot about character repetitions. Maybe divide the input into chunks without repeating chars and apply my suggestion to all of those chunks.
I would restate your problem thus, and give you a strategy for that restatement:
"A substitution cypher where a change in input leads to a larger change in output".
The blocking of characters is irrelevant-- in the end, it's just mappings between numbers. I'll speak of letters here, but you can extend it to any block of n characters.
One of the easiest routes for this is a rotating substitution based on input. Since you already looked at the Vigenere cipher, it should be easy to understand. Instead of making the key be static, have it be dependent on the previous letter. That is, rotate through substitutions a different amount per each input.
The variable rotation satisfies the condition of making each small change push out to a larger change. Note that the algorithm will only push changes in one direction such that changes towards the end have smaller effects. You could run the algorithm both ways (front-to-back, then back-to-front) so that every letter of cleartext changed has the possibility of changing the entire string.
The internal rotation strategy elides the need for keys, while of course losing of most of the cryptographic security. It makes sense in context, though, as you are aiming for entropy rather than security.
You can solve this problem with Format-preserving encryption.
One Java-Library can be found under https://github.com/EVGStudents/FPE.git. There you can define a Regex and encrypt/decrypt string values matching this regex.

Resources