I am using spark 1.6 cosine similarity (DIMSUM) algorithm.
Referring: https://github.com/eBay/Spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala
Here is what I am doing.
Input:
50k documents' text with ids in a dataframe.
Processing :
Tokenized the texts
Generated vectors using word2Vec
Generated RowMatrix
Used columnSimilarities method with threshold (DIMSUM)
Output:
Got a coordinate matrix
On printing out entries of this coordinate matrix I get output of
format example: MatrixEntry(133,185,0.04106425850610451)
I do not understand what are the numbers 133 and 185. My guess was these were the document IDs/sequence number but I am not sure. Can anyone please help here?
Apologies if this question is very trivial.
MatrixEntry(i, j, value) represents a similarity between i-th and j-th column so
MatrixEntry(133,185,0.04106425850610451)
is a similarity between 133th and 185th column. These value correspond to terms not documents.
Related
Source:- https://machinelearningmastery.com/k-nearest-neighbors-for-machine-learning/
This page has a section quoting the following passage:-
Best Prepare Data for KNN
Rescale Data: KNN performs much better if all of the data has the same scale. Normalizing your data to the range [0, 1] is a good idea. It may also be a good idea to standardize your data if it has a Gaussian
distribution.
Address Missing Data: Missing data will mean that the distance between samples cannot be calculated. These samples could be excluded or the missing values could be imputed.
Lower Dimensionality: KNN is suited for lower dimensional data. You can try it on high dimensional data (hundreds or thousands of input variables) but be aware that it may not perform as well as other techniques. KNN can benefit from feature selection that reduces the dimensionality of the input feature space.
Please, can someone explain the Second point, i.e. Address Missing Data, in detail?
Missing data in this context means that some samples do not have all the existing features.
For example:
Suppose you have a database with age and height for a group of individuals.
This would mean that for some persons either the height or the age is missing.
Now, why this affects KNN?
Given a test sample
KNN finds the samples that are closer to it (Aka: the students with similar age and height).
KNN does this to make some inference about the test sample based on its nearest neighbors.
If you want to find these neighbors you must be able to compute the distance between samples. To compute the distance between 2 samples you must have all the features for these 2 samples.
If some of them are missing you won't be able to compute distance.
So implicitly you would be lossing the samples with missing data
I realize this is an unspecific question (because I don't know a lot about the topic, please help me in this regard), that said here's the task I'd like to achieve:
Find a statistically sound algorithm to determine an optimal cut-off value to binarize a vector to filter out minimal values (i.e. get rid of). Here's code in matlab to visualize this problem:
randomdata=rand(1,100,1);
figure;plot(randomdata); %plot random data between 0 and 1
cutoff=0.5; %plot cut-off value
line(get(gca,'xlim'),[cutoff cutoff],'Color','red');
Thanks
You could try using Matlab's percentile function:
cutoff = prctile(randomdata,10);
In the gensim's documentation window size is defined as,
window is the maximum distance between the current and predicted word within a sentence.
which should mean when looking at context it doesn't go beyond the sentence boundary. right?
What i did was i created a document with several thousand tweets and selected a word (q1) and then selected most similar words to q1 (using model.most_similar('q1')). But then, if I randomly shuffle the tweets in the input document and then did the same experiment (without changing word2vec parameters) I got a different set most_similar words to q1.
Can't really understand why that happens if only it's gonna look at is sentence level information? can anyone explain this?
EDIT: added model parameters and a graph
used model parameters:
model1 = word2vec.Word2Vec(sents1 , size=100, window=5, min_count=5, iter=n_iter, sg=0)
Graph:
To draw the graph what i did was I ran word2vec with above parameters for the original document (D) and the shuffled document (D') and took the top 10 or 20 (two bars) most_similar('q') words to a specific query word q, and calculated the jaccard similarity score between the two sets of words when iter=1,10,100.
It seems as the no of iterations increase, lesser and lesser similar words between the two sets of words got from running word2vec on D and D'.
can't really understand why this is happening or what's going on?
I have a list of many users (over 10 million) each of which is represented by a userid followed by 10 floating-point numbers indicating their preference. I would like to efficiently calculate the user similarity matrix using cosine similarity based on mapreduce. However, since the values are floating-point numbers, it is hard to determine a key in the mapreduce framework. Any suggestions?
I think the easiest solution would be the Mahout library. There are a couple of map-reduce similarity matrix jobs in Mahout that might work for your use case.
The first is Mahout's ItemSimilarityJob that is part of its recommender system libraries. The specific info for that job can be found here. You would simply need to provide the input data in the required format and choose your VectorSimilarityMeasure (which for your case would be SIMILARITY_COSINE) along with any additional optimizations. Since you are looking to calculate user-user similarity based on a preference vector of ten floating point value, what you could do is assign a simple 1-to-10 numeric hash for the indices of the vector and generate a simple .csv file of vectorIndex, userID, decimalValue as input for the Mahout item-similarity job (the userID being a numeric Int or Long value). The resulting output should be a tab separated text file of userID,userID,similarity.
A second solution might be Mahout's RowSimilarityJob included in its math library. I've never used it myself, but some info can be found here and in this previous stackoverflow thread. Rather than a .csv as input, you would need to translate your input data as a DistributedRowMatrix, the userIDs being the rows of the matrix. The output, I believe, will also be a DistributedRowMatrix sequence file containing the user-user similarity data you are seeking.
I suppose which solution is better depends on what input/output format you prefer. All the best.
I am quite new to R, I am trying to do a Corresp analysis (MASS package) on summarized data. While the output shows row and column score, the resulting biplot shows the column scores as zero, making the plot unreadable (all values arranged by row scores in an expected manner, but flat along the column scores).
the code is
corresp(some_data)
biplot(corresp(some_data, nf = 2))
I would be grateful for any suggestions as to what I'm doing wrong and how to amend this, thanks in advance!
Martin
link to the image
the plot
corresp results
As suggested here:
http://www.statsoft.com/textbook/correspondence-analysis
the biplot actually depicts distributions of the row/column variables over 2 extracted dimensions where the variables' dependency is "the sharpest".
Looks like in your case a good deal of dependencies is concentrated along just one dimension, while the second dimension is already mush less significant.
It does not seem, however, that you relationships are weak. On the contrary, looking at your graph, one can observe the red (column) variable's interception with 2 distinct regions of the other variable values.
Makes sense?
Regards,
Igor