How to calculate vectors with word2vec - gensim

Assume I have a number of pictures. Let’s say 10 pictures which are annotated by 50 people each.
So Pic 1 might be „beach, vacation, relax, sand, sun…“ I now trained word2vec with a domain specific content. I have the vectors of each word and can represent them. But what I want now, is to create ONE final vector representing each picture. So one vector with represents the 50 annotations (beach, vacation, relax, sand, sun…)
Let’s assume each vector is represented with 100 dimensions – do I just add the first dimension (the 100 dimensions) of all 50 vectors, than the 2nd dimension of all 50 vectors… etc.
I am very thankful for any comments that might help me!
I tried this, but I am not sure if this is the right way to do it.
I also tried doc2vec but I guess this is problematic as the word order of the annotations is irrelevant – but relevant for doc2vec.???

A few thoughts:
A list of annotations isn't quite like natural-language narrative text – in either the relative frequencies of tokens, or the importance of neighboring-tokens. So you may want to try out a extra-wide range of training parameters. For example, using a giant window (far larger than each of your texts) could essentially negate the (possibly-arbitrary) ordering of the annotations, putting every word in every other word's context. (That'd increase training time, but might help in other ways.) Also, look into the newly-tunable ns_exponent parameter - the paper referenced from the gensim docs suggests values very different from the default may help in certain recommendation contexts.
That said, the most-simple way to combine many vectors into one is to average them all together. Whether that works well for your purposes, you'd have to test. (It unavoidably loses the information of a larger set of independent vectors, but if other aspects of your modeling are strong enough – enough training data, enough dimensions – the really important shared aspects may be retained in the summary vector.)
(You can see some code for averaging word-vectors in another recent answer.)
You could also try Doc2Vec. It is no more ordering-dependent than Word2Vec – some modes use a window-sized context where neighboring words influence each other. (But other modes don't, or, as mentioned above, an oversized window can essentially make neighboring-distances less-relevant).

Related

How to interpret doc2vec classifier in terms of words?

I have trained a doc2vec (PV-DM) model in gensim on documents which fall into a few classes. I am working in a non-linguistic setting where both the number of documents and the number of unique words are small (~100 documents, ~100 words) for practical reasons. Each document has perhaps 10k tokens. My goal is to show that the doc2vec embeddings are more predictive of document class than simpler statistics and to explain which words (or perhaps word sequences, etc.) in each document are indicative of class.
I have good performance of a (cross-validated) classifier trained on the embeddings compared to one compared on the other statistic, but I am still unsure of how to connect the results of the classifier to any features of a given document. Is there a standard way to do this? My first inclination was to simply pass the co-learned word embeddings through the document classifier in order to see which words inhabited which classifier-partitioned regions of the embedding space. The document classes output on word embeddings are very consistent across cross validation splits, which is encouraging, although I don't know how to turn these effective labels into a statement to the effect of "Document X got label Y because of such and such properties of words A, B and C in the document".
Another idea is to look at similarities between word vectors and document vectors. The ordering of similar word vectors is pretty stable across random seeds and hyperparameters, but the output of this sort of labeling does not correspond at all to the output from the previous method.
Thanks for help in advance.
Edit: Here are some clarifying points. The tokens in the "documents" are ordered, and they are measured from a discrete-valued process whose states, I suspect, get their "meaning" from context in the sequence, much like words. There are only a handful of classes, usually between 3 and 5. The documents are given unique tags and the classes are not used for learning the embedding. The embeddings have rather dimension, always < 100, which are learned over many epochs, since I am only worried about overfitting when the classifier is learned, not the embeddings. For now, I'm using a multinomial logistic regressor for classification, but I'm not married to it. On that note, I've also tried using the normalized regressor coefficients as vector in the embedding space to which I can compare words, documents, etc.
That's a very small dataset (100 docs) and vocabulary (100 words) compared to much published work of Doc2Vec, which has usually used tens-of-thousands or millions of distinct documents.
That each doc is thousands of words and you're using PV-DM mode that mixes both doc-to-word and word-to-word contexts for training helps a bit. I'd still expect you might need to use a smaller-than-defualt dimensionaity (vector_size<<100), & more training epochs - but if it does seem to be working for you, great.
You don't mention how many classes you have, nor what classifier algorithm you're using, nor whether known classes are being mixed into the (often unsupervised) Doc2Vec training mode.
If you're only using known classes as the doc-tags, and your "a few" classes is, say, only 3, then to some extent you only have 3 unique "documents", which you're training on in fragments. Using only "a few" unique doctags might be prematurely hiding variety on the data that could be useful to a downstream classifier.
On the other hand, if you're giving each doc a unique ID - the original 'Paragraph Vectors' paper approach, and then you're feeding those to a downstream classifier, that can be OK alone, but may also benefit from adding the known-classes as extra tags, in addition to the per-doc IDs. (And perhaps if you have many classes, those may be OK as the only doc-tags. It can be worth comparing each approach.)
I haven't seen specific work on making Doc2Vec models explainable, other than the observation that when you are using a mode which co-trains both doc- and word- vectors, the doc-vectors & word-vectors have the same sort of useful similarities/neighborhoods/orientations as word-vectors alone tend to have.
You could simply try creating synthetic documents, or tampering with real documents' words via targeted removal/addition of candidate words, or blended mixes of documents with strong/correct classifier predictions, to see how much that changes either (a) their doc-vector, & the nearest other doc-vectors or class-vectors; or (b) the predictions/relative-confidences of any downstream classifier.
(A wishlist feature for Doc2Vec for a while has been to synthesize a pseudo-document from a doc-vector. See this issue for details, including a link to one partial implementation. While the mere ranked list of such words would be nonsense in natural language, it might give doc-vectors a certain "vividness".)
Whn you're not using real natural language, some useful things to keep in mind:
if your 'texts' are really unordered bags-of-tokens, then window may not really be an interesting parameter. Setting it to a very-large number can make sense (to essentially put all words in each others' windows), but may not be practical/appropriate given your large docs. Or, trying PV-DBOW instead - potentially even mixing known-classes & word-tokens in either tags or words.
the default ns_exponent=0.75 is inherited from word2vec & natural-language corpora, & at least one research paper (linked from the class documentation) suggests that for other applications, especially recommender systems, very different values may help.

Negative Training Image Examples for CNN

I am using the Caffe framework for CNN training. My aim is to perform simple object recognition for a few basic object categories. Since pretrained networks are not an alternative for my proposed usage I prepared an own training- and testset with about 1000 images for each of 2 classes (say chairs and cars).
The results are quite good. If I present an yet unseen image of a chair it is likely classified as such, same for a car image. My problem is that the results on miscellaneous images that do not show any of these classes often shows a very high confidence (=1) for one random class (which is not surprising regarding the onesided training data but a problem for my application). I thought about different solutions:
1) Adding a third class with also about 1000 negative examples that shows any objects except a chair and a car.
2) Adding more object categories in general, just to let the network classify other objects as such and not any more as a chair or car (of course this would require much effort). Maybe also the broader prediction results would show a more uniform distribution at negative images, allowing to evaluate the target objects presence based on a threshold?
Because it was not much time-consuming to grab random images as negative examples from the internet, I already tested my first solution with about 1200 negative examples. It helped, but the problem remains, perhaps because it were just too few? My concern is that if I increment the number of negative examples, the imbalance of the number of examples for each class leads to less accurate detection of the original classes.
After some research I found one person with a similar problem, but there was no solution:
Convolutional Neural Networks with Caffe and NEGATIVE IMAGES
My question is: Has anyone had the same problem and knows how to deal with it? What way would you recommend, adding more negative examples or more object categories or do you have any other recommendation?
The problem is not unique to Caffe or ConvNets. Any Machine Learning technique runs this risk. In the end, all classifiers take a vector in some input space (usually very high-dimensional), which means they partition that input space. You've given examples of two partitions, which helps to estimate the boundary between the two, but only that boundary. Both partitions have very, very large boundaries, precisely because the input space is so high-dimensional.
ConvNets do try to tackle the high-dimensionality of image data by having fairly small convolution kernels. Realistic negative data helps in training those, and the label wouldn't really matter. You could even use the input image as goal (i.e. train it as an autoencoder) when training the convolution kernels.
One general reason why you don't want to lump all counterexamples is because they may be too varied. If you have a class A with some feature value from the range [-1,+1] on some scale, with counterexamples B [-2,-1] and C [+1,+2], lumping B and C together creates a range [-2,+2] for counterexamples which overlaps the real real range. Given enough data and powerful enough classifiers, this is not fatal, but for instance an SVM can fail badly on this.

Remove noisy and redundant features

I have extracted features from a video sequence based on facial markers as means and standard deviations of those markers over a video sequence. They need to be classified into four different classes based on those markers.
In all I have a feature set of around 260 features. How should I determine which features are noisy and redundant in my set. I read about it in some research papers and some of them used the plus l take away r algorithm that I found to be quite appropriate but in such algorithms they always rate one feature against the other and say its good or bad compared to it.
How do I rate my features to be good or bad? What criterion are used for that generally?
I researched a lot for a couple of days but found nothing clear cut and useful. Would be grateful for the help, Thanks.
Think of your 260 features as a basis for a 260 dimensional room. However, your basis-vectors are not normal to each other so they contain a lot of redundant information. You'd like to transform these vectors into a vector-set where all vectors are normal to each other, thus minimizing the dimensions without losing (much) information.
This is what Principal component analysis does.
Linear discriminant analysis may also be of interest to you.
You can use pca or you can train some classifiers, and after this you loop all over yours features adding a big value to each feature, testing if this alteration changes the precision of the classifier, if not, you can remove this feature, after remove all the redundat features, and then retrain your classifiers!
Its a good ideia to train not one classifier but a lot of them, and them make your prediction based on votes, you can user MODE function in matlab to do this!
Use classification rate to determine a subset of feature how much good. You have 260 feature and then have 2^260 subset, this is too much! and search in this space is very difficult. Thus it's better to remove some feature by Filter method (for example FA, t-test, fisher and ...) and then use your search method to find best subset of feature.
Plus l take away r algorithm (or other search algorithm) find various subset and rate it (in this stage use classification rate) and at last specify which subset is better.

How to detect how similar a speech recording is to another speech recording?

I would like to build a program to detect how close a user's audio recording is to another recording in order to correct the user's pronunciation. For example:
I record myself saying "Good morning"
I let a foreign student record "Good morning"
Compare his recording to mine to see if his pronunciation was good enough.
I've seen this in some language learning tools (I believe Rosetta Stone does this), but how is it done? Note we're only dealing with speech (and not, say, music). What are some algorithms or libraries I should look into?
A lot of people seem to be suggesting some sort of edit distance, which IMO is a totally wrong approach for determining the similarity of two speech patterns, especially for patterns as short as OP is implying. The specific algorithms used by speech-recognition in fact are nearly the opposite of what you would like to use here. The problem in speech recognition is resolving many similar pronunciations to the same representation. The problem here is to take a number of slightly different pronunciations and get some kind of meaningful distance between them.
I've done quite a bit of this stuff for large scale data science, and while I can't comment on exactly how proprietary programs do it, I can comment on how it's done in academia and provide a solution that is straightforward and will give you the power and flexibility that you want for this approach.
Firstly: Assuming that what you have is some chunk of audio without any filtering done on it. Just as it would be acquired from a microphone. The first step is to eliminate background noise. There are a number of different methods for this, but I'm going to assume that what you want is something that will work well without being incredibly difficult to implement.
Filter the audio using scipy's filtering module here. There are a lot of frequencies that microphones pick up that are simply not useful for categorizing speech. I would suggest either a Bessel or a Butterworth filter to ensure that your waveform is persevered through filtering. The fundamental frequencies for everyday speech are generally between 800 and 2000 Hz (reference) so a reasonable cutoff would be something like 300 to 4000 Hz, just to make sure you don't lose anything.
Look for the least active portion of speech and assume that is a reasonable representation of background noise. At this point you're going to want to run a series of fourier transforms along your data (or generate a spectrogram) and find the part of your speech recording that has the lowest average frequency response. Once you have that snapshot, you should subtract it from all other points in your audio sample.
At this point should should have an audio file that is mostly just your user's speech and should be ready to be compared to another file that has gone through this process. Now, we want to actually clip the sound and compare this clip to some master clip.
Secondly: You're going to want to come up with a distance metric between two speech patterns, there are a number of ways to do this, but I'm going to assume we have the output of part one and some master file that has been through similar processing.
Generate a spectrogram of the audio file in question (example). The output from this is ultimately going to be an image that can be represented as a 2-d array of frequency response values. A spectrogram is essentially a fourier transform over time where the colour corresponds to intensity.
Use OpenCV (has python bindings, example) to run blob detection on your spectrogram. Effectively this is going to look for the big colorful blob in the middle of your spectrogram, and give you some limits on this. Effectively, what this should do, is return a significantly more sparse version of the original 2d-array that solely represents the speech in question. (With the assumption that your audio file will have some trailing stuff on the front and back ends of recording)
Normalize the two blobs to account for differences in speech speed. Everyone talks at a different speeds, and as such your blobs will probably have different sizes along the x-axis (time). This will ultimately introduce a level of checks in your algorithm that you don't want for the speed of speech. This step isn't needed if you also want to make sure that they speak at the same speed as the master copy, but I would suggest it. Basically you want to stretch out the shorter version by multiplying it's time axis by some constant that's just the ratio of the lengths of your two blobs.
You should also normalize the two blobs based on maximum and minimum intensity to account for people that talk at different volumes. Again, this is up to your discretion, but to fix this you should find similar ratios for the total span of intensities that you have as well as the two recording's max intensities and make sure that these two values match up between your 2-d arrays.
Third: Now that you have 2-d arrays representing your two speech events, that should in theory contain all of their useful information it's time to directly compare them. Luckily, comparing two matrices is a well-solved problem and there are a number of ways to move forward.
Personally I would suggest using a metric like Cosine Similarity to determine the difference between your two blobs, but that's not the only solution and while it'll give you a quick validation, you can do better.
You could try subtracting one matrix from the other and get an evaluation of how much difference there is between them, which would probably be more accurate than simple cosine distance.
It might be overkill, but you could assume that there are certain regions of speech that matter more or less for evaluating difference between blobs (it might not matter if someone uses a long i instead of a short i, but a g instead of a k could be a different word entirely). For something like that you'd want to develop a mask for the difference array in the previous step and multiply all your values by that.
Whichever method you choose, you can now simply set some difference threshold and make sure that the difference between the two blobs is below your desired threshold. If it is, the captured speech is similar enough to be correct. Otherwise have them try again.
I hope that's helpful, and again, I can't assure you that this is the exact algorithm that a company uses since that information is hugely proprietary and not open for the public, but I can assure you that methods similar to these are used in the very best papers in academia and that these methods will get you a great balance of accuracy and ease of implementation. Let me know if you have any questions, and good luck with your future data science exploits!
The musicg api https://code.google.com/p/musicg/
has a audio fingerprint generator and scorer
along with source code to show how its done.
I think it looks for the most similar point in each track, then scores based on how far it can match.
It might look something like
import com.musicg.wave.Wave
com.musicg.fingerprint.FingerprintSimilarity
com.musicg.fingerprint.FingerprintSimilarityComputer
com.musicg.fingerprint.FingerprintManager
double score =
new FingerprintsSimilarity(
new Wave("voice1.wav").getFingerprint(),
new Wave("voice2.wav").getFingerprint() ).getSimilarity();
Idea:
The way biotechnologists align two protein sequences is as follows: Each sequence is represented as a string on an alphabet as(A/C/G/T - these are different types of proteins, irrelevant for us), where each letter (here, an entry) represents a particular amino acid. The quality of an alignment (its score) is calculated from the similarity of each pair of corresponding entries, and the number and length of the blank entries that need to be inserted to produce that alignment.
Same algorithm (http://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm) can be used for pronunciation, from substitution frequencies in a set of alternate pronunciations. Then you can calculate alignment scores to measure the similarity between the two pronunciations in a way that is sensitive to the differences between phonemes. Measures of similarity that can be used here are Levenshtein distance, phoneme error rate, and word error rate.
Algorithms
The minimum number of insertions, deletions and substitutions required for transformation of one sequence into another is the Levenshtein distance. More info at http://php.net/manual/en/function.levenshtein.php
Phoneme error rate (PER) is the Levenshtein distance between a predicted pronunciation and the reference pronunciation, divided by the number of phonemes in the reference pronunciation.
Word error rate (WER) is the proportion of predicted pronunciations with at least one phoneme error to the total number of pronunciations.
Source: Did an Internship on this at UW-Madison
A carefully configured Levenshtein distance should do the trick.
I know this question is out of date but...
To solve a similar problem I used Google Speech Recognized API to check WHAT was said and visual compare scaled wave forms of volume changes to detect differences in rhythm.
Code & video of the result.
you can use Musicg https://code.google.com/p/musicg/ as roy zhang suggested. In android, just include musicg jar file in your android project and use it. A tested example:
import com.musicg.wave.Wave;
import com.musicg.fingerprint.FingerprintSimilarity;
//somewhere in your code add
String file1 = Environment.getExternalStorageDirectory().getAbsolutePath();
file1 += "/test.wav";
String file2 = Environment.getExternalStorageDirectory().getAbsolutePath();
file2 += "/test.wav";
Wave w1 = new Wave(file1);
Wave w2 = new Wave(file2);
FingerprintSimilarity fps = w1.getFingerprintSimilarity(w2);
float score = fps.getScore();
float sim = fps.getSimilarity();
Log.d("score", score+"");
Log.d("similarities", sim+"");
Good luck
If this is only to check the pronunciation [of course with different accent], you can do this :
Step 1 : Using some voice tool [say dragon dictation], you can have the text with you.
Step 2 : Compare the string or the word formed and compare it with the string that actually was meant to be pronounced.
Step 3 : If you find any discrepancy in the strings, means the word was not spelled correctly. And you can suggest the correct pronunciation.
You have to look into speech recognition algorithms. I understand that you don't need to translate speech to text (that is done by speech recognition algorithms), however, in your case many algorithms would be the same.
Probably, HMM would be helpful here (hidden markov models).
Also look into here: http://htk.eng.cam.ac.uk/

Which data mining algorithm would you suggest for this particular scenario?

This is not a directly programming related question, but it's about selecting the right data mining algorithm.
I want to infer the age of people from their first names, from the region they live, and if they have an internet product or not. The idea behind it is that:
there are names that are old-fashioned or popular in a particular decade (celebrities, politicians etc.) (this may not hold in the USA, but in the country of interest that's true),
young people tend to live in highly populated regions whereas old people prefer countrysides, and
Internet is used more by young people than by old people.
I am not sure if those assumptions hold, but I want to test that. So what I have is 100K observations from our customer database with
approx. 500 different names (nominal input variable with too many classes)
20 different regions (nominal input variable)
Internet Yes/No (binary input variable)
91 distinct birthyears (numerical target variable with range: 1910-1992)
Because I have so many nominal inputs, I don't think regression is a good candidate. Because the target is numerical, I don't think decision tree is a good option either. Can anyone suggest me a method that is applicable for such a scenario?
I think you could design discrete variables that reflect the split you are trying to determine. It doesn't seem like you need a regression on their exact age.
One possibility is to cluster the ages, and then treat the clusters as discrete variables. Should this not be appropriate, another possibility is to divide the ages into bins of equal distribution.
One technique that could work very well for your purposes is, instead of clustering or partitioning the ages directly, cluster or partition the average age per name. That is to say, generate a list of all of the average ages, and work with this instead. (There may be some statistical problems in the classifier if you the discrete categories here are too fine-grained, though).
However, the best case is if you have a clear notion of what age range you consider appropriate for 'young' and 'old'. Then, use these directly.
New answer
I would try using regression, but in the manner that I specify. I would try binarizing each variable (if this is the correct term). The Internet variable is binary, but I would make it into two separate binary values. I will illustrate with an example because I feel it will be more illuminating. For my example, I will just use three names (Gertrude, Jennifer, and Mary) and the internet variable.
I have 4 women. Here are their data:
Gertrude, Internet, 57
Jennifer, Internet, 23
Gertrude, No Internet, 60
Mary, No Internet, 35
I would generate a matrix, A, like this (each row represents a respective woman in my list):
[[1,0,0,1,0],
[0,1,0,1,0],
[1,0,0,0,1],
[0,0,1,0,1]]
The first three columns represent the names and the latter two Internet/No Internet. Thus, the columns represent
[Gertrude, Jennifer, Mary, Internet, No Internet]
You can keep doing this with more names (500 columns for the names), and for the regions (20 columns for those). Then you will just be solving the standard linear algebra problem A*x=b where b for the above example is
b=[[57],
[23],
[60],
[35]]
You may be worried that A will now be a huge matrix, but it is a huge, extremely sparse matrix and thus can be stored very efficiently in a sparse matrix form. Each row has 3 1's in it and the rest are 0. You can then just solve this with a sparse matrix solver. You will want to do some sort of correlation test on the resulting predicting ages to see how effective it is.
You might check out the babynamewizard. It shows the changes in name frequency over time and should help convert your names to a numeric input. Also, you should be able to use population density from census.gov data to get a numeric value associated with your regions. I would suggest an additional flag regarding the availability of DSL access - many rural areas don't have DSL coverage. No coverage = less demand for internet services.
My first inclination would be to divide your response into two groups, those very likely to have used computers in school or work and those much less likely. The exposure to computer use at an age early in their career or schooling probably has some effect on their likelihood to use a computer later in their life. Then you might consider regressions on the groups separately. This should eliminate some of the natural correlation of your inputs.
I would use a classification algorithm that accepts nominal attributes and numeric class, like M5 (for trees or rules). Perhaps I would combine it with the bagging meta classifier to reduce variance. The original algorithm M5 was invented by R. Quinlan and Yong Wang made improvements.
The algorithm is implemented in R (library RWeka)
It also can be found in the open source machine learning software Weka
For more information see:
Ross J. Quinlan: Learning with Continuous Classes. In: 5th Australian Joint Conference on Artificial Intelligence, Singapore, 343-348, 1992.
Y. Wang, I. H. Witten: Induction of model trees for predicting continuous classes. In: Poster papers of the 9th European Conference on Machine Learning, 1997.
I think slightly different from you, I believe that trees are excellent algorithms to deal with nominal data because they can help you build a model that you can easily interpret and identify the influence of each one of these nominal variables and it's different values.
You can also use regression with dummy variables in order to represent the nominal attributes, this is also a good solution.
But you can also use other algorithms such as SVM(smo), with the previous transformation of the nominal variables to binary dummy ones, same as in regression.

Resources