Do I need to create separate embedding matrices for source and target vocab for abstractive summarization model? - stanford-nlp

I'm working on a Seq2Seq model to perform abstractive summarization using the Glove pre-trained word embeddings. Is it required I make two embedding matrices? One that covers the source vocabulary and one that covers the summary vocabulary.

No, the common practice is to share the embedding matrices even in machine translation where the words are from different languages.
Sometimes, the embedding matrix is also used as an output projection matrix when generating model output (see e.g., the Attention is all you need paper), however, this is only possible if you use a vocabulary of tens of thousands of (sub)word as opposed to the very large vocabulary of GloVe.

Related

How to interpret doc2vec classifier in terms of words?

I have trained a doc2vec (PV-DM) model in gensim on documents which fall into a few classes. I am working in a non-linguistic setting where both the number of documents and the number of unique words are small (~100 documents, ~100 words) for practical reasons. Each document has perhaps 10k tokens. My goal is to show that the doc2vec embeddings are more predictive of document class than simpler statistics and to explain which words (or perhaps word sequences, etc.) in each document are indicative of class.
I have good performance of a (cross-validated) classifier trained on the embeddings compared to one compared on the other statistic, but I am still unsure of how to connect the results of the classifier to any features of a given document. Is there a standard way to do this? My first inclination was to simply pass the co-learned word embeddings through the document classifier in order to see which words inhabited which classifier-partitioned regions of the embedding space. The document classes output on word embeddings are very consistent across cross validation splits, which is encouraging, although I don't know how to turn these effective labels into a statement to the effect of "Document X got label Y because of such and such properties of words A, B and C in the document".
Another idea is to look at similarities between word vectors and document vectors. The ordering of similar word vectors is pretty stable across random seeds and hyperparameters, but the output of this sort of labeling does not correspond at all to the output from the previous method.
Thanks for help in advance.
Edit: Here are some clarifying points. The tokens in the "documents" are ordered, and they are measured from a discrete-valued process whose states, I suspect, get their "meaning" from context in the sequence, much like words. There are only a handful of classes, usually between 3 and 5. The documents are given unique tags and the classes are not used for learning the embedding. The embeddings have rather dimension, always < 100, which are learned over many epochs, since I am only worried about overfitting when the classifier is learned, not the embeddings. For now, I'm using a multinomial logistic regressor for classification, but I'm not married to it. On that note, I've also tried using the normalized regressor coefficients as vector in the embedding space to which I can compare words, documents, etc.
That's a very small dataset (100 docs) and vocabulary (100 words) compared to much published work of Doc2Vec, which has usually used tens-of-thousands or millions of distinct documents.
That each doc is thousands of words and you're using PV-DM mode that mixes both doc-to-word and word-to-word contexts for training helps a bit. I'd still expect you might need to use a smaller-than-defualt dimensionaity (vector_size<<100), & more training epochs - but if it does seem to be working for you, great.
You don't mention how many classes you have, nor what classifier algorithm you're using, nor whether known classes are being mixed into the (often unsupervised) Doc2Vec training mode.
If you're only using known classes as the doc-tags, and your "a few" classes is, say, only 3, then to some extent you only have 3 unique "documents", which you're training on in fragments. Using only "a few" unique doctags might be prematurely hiding variety on the data that could be useful to a downstream classifier.
On the other hand, if you're giving each doc a unique ID - the original 'Paragraph Vectors' paper approach, and then you're feeding those to a downstream classifier, that can be OK alone, but may also benefit from adding the known-classes as extra tags, in addition to the per-doc IDs. (And perhaps if you have many classes, those may be OK as the only doc-tags. It can be worth comparing each approach.)
I haven't seen specific work on making Doc2Vec models explainable, other than the observation that when you are using a mode which co-trains both doc- and word- vectors, the doc-vectors & word-vectors have the same sort of useful similarities/neighborhoods/orientations as word-vectors alone tend to have.
You could simply try creating synthetic documents, or tampering with real documents' words via targeted removal/addition of candidate words, or blended mixes of documents with strong/correct classifier predictions, to see how much that changes either (a) their doc-vector, & the nearest other doc-vectors or class-vectors; or (b) the predictions/relative-confidences of any downstream classifier.
(A wishlist feature for Doc2Vec for a while has been to synthesize a pseudo-document from a doc-vector. See this issue for details, including a link to one partial implementation. While the mere ranked list of such words would be nonsense in natural language, it might give doc-vectors a certain "vividness".)
Whn you're not using real natural language, some useful things to keep in mind:
if your 'texts' are really unordered bags-of-tokens, then window may not really be an interesting parameter. Setting it to a very-large number can make sense (to essentially put all words in each others' windows), but may not be practical/appropriate given your large docs. Or, trying PV-DBOW instead - potentially even mixing known-classes & word-tokens in either tags or words.
the default ns_exponent=0.75 is inherited from word2vec & natural-language corpora, & at least one research paper (linked from the class documentation) suggests that for other applications, especially recommender systems, very different values may help.

How to measure similarity between words or very short text

I work on the problem of finding the nearest document in a list of documents. Each document is a word or a very short sentence (e.g. "jeans" or "machine tool" or "biological tomatoes"). By closest I mean close in a semantical way.
I have tried to use word2vec embeddings (from Mikolov article) but the closest words or more contextually linked than semanticaly linked ("jeans" is linked to "shoes" and not "trousers" as expected).
I have tried to use Bert encoding (https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/#32-understanding-the-output) using last layers but it faces the same issues.
I have tried elastic search, but it doesn't find semantical similarities.
(The task needs to be solved in French but maybe solving it in English is a good first step)
Note different sets of word-vectors may vary in how well they capture your desired 'semantic' similarities. (In particular, training with a shorter window may emphasize similarity among words that are drop-in replacements for each other, as opposed to just used-in-similar domains, as larger window values may emphasize. See this answer for more details.)
You may also want to take a look at "Word Mover's Distance" as a way to compare short texts that contain various mixes of somewhat-similar words. (It's fairly expensive, but should be practical on your short texts. It's available in the Python gensim library as wmdistance() on KeyedVectors instances.)
If you have training data where your specific multi-word phrases are used, in many natural-language-like subtly-varied contexts, you could consider combining all such phrases-of-interest into single tokens (like machine_tool or biological_tomatoes), and training your own domain-specific word-vectors.
For computing similarity between short texts which contains 2 or 3 words, you can use word2vec with getting the average vector of the sentence.
for example, if you have a text (machine tool) and want to represent it in one vector using word2vec so you have to get the vector of "machine" and the vector if "tool" then combine them in one vector by getting the average vector which is to add the two vectors and divide by 2 (the number of words). this will give you a vector representation for a sentence which is more than one word.
You can use also something like doc2vec which is designed on the top of word2vec and its purpose to get a vector for a sentence or paragraph.
You might try document embedding that is built on top of word2vec
However, notice that word and document embedding do not always capture "desired similarity", they just learn a language model on your corpus, they are heavy influenced by text size and word frequency.
How big is your corpus? If you need it just to perform some classification it might be better to train your vectors on a large dataset such as Google News corpus.

Is there a way to load pre-trained word vectors before training the doc2vec model?

I am trying to build a doc2vec model with more or less 10k sentences, after that I will use the model to find the most similar sentence in the model of some new sentences.
I have trained a gensim doc2vec model using the corpus(10k sentences) I have. This model can to some extend tell me if a new sentence is similar to some of the sentences in the corpus.
But, there is a problem: it may happen that there are words in new sentences which don't exist in the corpus, which means that they don't have a word embedding. If this happens, the prediction result will not be good.
As far as I know, the trained doc2vec model does have a matrix of doc vectors as well as a matrix of word vectors. So what I were thinking is to load a set of pre-trained word vectors, which contains a large number of words, and then train the model to get the doc vectors. Does it make sense? Is it possible with gensim? Or is there another way to do it?
Unlike what you might guess, typical Doc2Vec training does not train up word-vectors first, then compose doc-vectors using those word-vectors. Rather, in the modes that use word-vectors, the word-vectors trained in a simultaneous, interleaved fashion alongside the doc-vectors, both changing together. And in one fast and well-performing mode, PV-DBOW (dm=0 in gensim), word-vectors aren't trained or used at all.
So, gensim Doc2Vec doesn't support pre-loading state from elsewhere, and even if it did, it probably wouldn't provide the benefit you expect. (You could dig through the source code & perhaps force it by doing a bunch of initialization steps yourself. But then, if words were in the pre-loaded set, but not in your training data, training the rest of the active words would adjust the entire model in direction incompatible with the imported-but-untrained 'foreign' words. It's only the interleaved, tug-of-war co-training of the model's state which makes the various vectors meaningful in relation to each other.)
The most straightforward and reliable strategy would be to try to expand your training corpus, by finding more documents from a similar/compatible domain, to include multiple varied examples of any words you might encounter later. (If you thought some other word-vectors were apt enough for your domain, perhaps the texts that were used to train those word-vectors can be mixed-into your training corpus. That's a reasonable way to put the word/document data from that other source on equal footing in your model.)
And, as new documents arrive, you can also occasionally re-train the model from scratch, with the now-expanded corpus, letting newer documents contribute equally to the model's vocabulary and modeling strength.

CAD mechanical bonds and kinematics

Is there a CAD format, which allows to describe mechanical bonds between objects (axes, joints, springs, etc.) and/or motion of objects ?
I don't have any personal CAD experience, but 3D objects can be defined using edges and nodes/vertices which reduces the problem to a graph-representation issue. 3D models defined using other techniques (such as composed functions or a filter pipeline, such as B-spline > Lathe or extrude > Transformation > Material/texture) will have considerably different means of serialization to file.
Storing motion is a separate issue entirely - and also complicated by the many different ways to represent animation, and if your system uses any kind of skeletal basis or solid-body physics/animation. Then there's keyframed and parameterised animation. Are you using an inverse-kinematics system too?
You need to be more specific about why you need such CAD format?
Without really knowing what you are after, I think the closer thing to what you are asking for is Finite Element analysis programs like ANSYS, ABAQAS, etc. They work in CAD like environment and allow adding many types of elements to simulate dynamic (motion) anaylses and many other type of analysis. They include a huge variety of elements like linear springs are rltational springs.. etc.

Artificial neural network image transformation

I have a pairs of images (input-output) but I don't know the transformation to going from A (input) to B (output). I want to record image A and get image B. Physically I can change the setup to get A or B, but I want to do it by software.
If I understood well, a trained Artificial Neural Network is able to do that, having an input can give the corresponding output, is it right?
Is there any software/ANN that just "training" it with entering a number of input-output pairs will be able to provide the correct output if the input is a new (but similar to the others) image?
Thanks
If you have some relevant amount of image pairs (input/output pair) and you don't know transformation between input and output you could train ANN on that training set to imitate that unknown transformation. You will be able to well train your ANN only if you have sufficient amount of training image pairs, but it could be pretty impossible when that unknown transformation is complicated.
For example if that transformation simply increases intensity values of pixels at input image by given value, ANN will very fast learn to imitate that behavior, but if that unknown transformation is some complicated convolution or few serial convolutions or something more complicated it will be very hard, near impossible to train ANN to imitate that transformation. So, more complex transformation will need bigger training set and more complex ANN design.
There are plenty of free opensource ANN libraries implemented in many languages. You could start for example with that tutorial: http://www.codeproject.com/Articles/13091/Artificial-Neural-Networks-made-easy-with-the-FANN
What you are asking is possible in principle -- in theory, an ANN with sufficiently many hidden units can learn an arbitrary function to map inputs to outputs. However, as the comments and other answers have mentioned, there may be many technical issues with your particular problem that could make it impractical. I would classify these problems as (a) mapping complexity, (b) model complexity, (c) scaling complexity, and (d) implementation complexity. They are all somewhat related, but hopefully this is a useful way to break things down.
Mapping complexity
As mentioned by Springfield762, there are many possible functions that map from one image to another image. If the relationship between your input images and your output images is relatively simple -- like increasing the intensity of each pixel by a constant amount -- then an ANN would be able to learn this mapping without much difficulty. There are probably many more transformations that would be similarly easy to learn, such as skewing, flipping, rotating, or translating an image -- basically any affine transformation would be easy to learn. Other, nonlinear transformations could also be feasible, such as squaring the intensity of each pixel.
As a general rule, the more complicated the relationship between your input and output images, the more difficult it will be to get a model to learn this mapping for you.
Model complexity
The more complex the mapping from inputs to outputs, the more complex your ANN model will be to be able to capture this mapping. Models with many hidden layers have been shown in the past 10 years to perform quite well on tasks that people had previously thought impossible, but often these state-of-the-art models have millions or even billions of parameters and take weeks to train on GPU hardware. A simple model can capture many simple mappings, but if you have a complex input-output map to learn, you'll need a large, complex model.
Scaling complexity
Yves mentioned in the comments that it can be difficult to scale models up to typical image sizes. If your images are relatively small (currently the state of the art is to model images on the order of 100x100 pixels), then you can probably just throw a bunch of raw pixel data at an ANN model and see what happens. But if you're using 6000x4000 images from your shiny Nikon DSLR, it's going to be quite difficult to process those in a reasonable amount of time. You'd be better off compressing your image data somehow (PCA is a common technique) and then trying to learn the mapping in the compressed space.
In addition, larger images will have a larger space of possible mappings between them, so you'll need more of your larger images as training data than you would if you had small images.
Springfield762 also mentioned this: If the mapping between your input and output images is simple, then you'll only need a few examples to learn the mapping successfully. But if you have a complicated mapping, then you'll need much more training data to have a chance at learning the mapping properly.
Implementation complexity
It's unlikely that a tool already exists that would let you just throw image data into an ANN model and have a mapping appear. Most likely you'll need, at a minimum, to implement some code that will pre-process your image data. In addition, if you have lots of large images you'll probably need to write code to handle loading data from disk, etc. (There are a lot of "big data" tools for things like this, but they all require some amount of work to get set up.)
There are many, many open source ANN toolkits out there nowadays. FANN (already mentioned) is a popular one in C++ with bindings in other languages. Caffe is quite popular, and is also implemented in C++ with bindings. There seem to be many toolkits that use Python and Theano or some other GPU acceleration library -- Keras, Lasagne, Hebel, Pylearn2, neon, and Theanets (I wrote this one). Many people use Torch, written in Lua. Matlab has at least one neural network toolbox. I'm less familiar with other ecosystems, but Java seems to have Deeplearning4j, C# has Accord, and even R has darch.
But with any of these neural network toolkits, you're going to have to write some code to load the data, process it into the appropriate input format, construct (or load) a network model, train the model, etc.
The problem you're trying to solve is a canonical classification problem that neural networks can help you solve. You treat the B images as a set of labels that you match to A, and once trained, the neural network will be able to match the B images to new input based on where the network locates new input in a high-dimensional vector space. I assume you'd use some combination of convolutional networks to create your features, and softmax for multinomial classification on the output layer. More here: http://deeplearning4j.org/convolutionalnets.html
Since this has been written there has been a lot of work in the realm of cgans ( conditional generative adversarial networks ) please refer to:
https://arxiv.org/pdf/1611.07004.pdf

Resources