As far as I know, doc2vec computes both embeddings for documents and words. Can I use a word vector and a document vector to estimate the similarity of a word to a document or only documents against documents and words against words? Any remark would be helpful.
Related
We have tested different analyzers on search Index but none of them gives a natural sorting order. All the analyzers except keyword analyzer tokenize the given input string before applying the sorting algorithm. With Keyword Analyzer we get ASCII sorting, but there is an issue. The lowercase strings are placed at the end of the list and all uppercase strings at the start. This is because, in ASCII, lower case letters are larger than the uppercase.
Example:
Input Strings: steve, John, Dave, george
After sorting: Dave, John, goerge, steve
Expected Output: Dave, george, John, steve
I wish to know if there is a way in cloudant where we can achieve natural/alphabetical sorting order irrespective of the case?
Does anybody tried to customize BM25 similarity used in Elasticsearch in a following way?
This is a common BM25 score. I want term frequencies to be binary (0 if a term is not presented in the document and 1 if term frequency in the document if greater 0). So in the pic below I want tf(q_i, d) to be {0, 1}.
Any ideas what is the easiest way to achieve this in Elasticsearch?
One way to achieve this is to use the Unique Token Filter which will index only unique tokens during analysis.
This should be equivalent to having a term frequency of 1 in the document if a token exists.
If I set a field's similarity function to be something like BM25, does More Like This use that similarity score to pick top words for it's disjunctive boolean search (or is the default tf-idf used, as suggested by the docs: [MLT] selects the top K terms with highest tf-idf to form a disjunctive query of these terms.)? And separately, does the returned order/score reflect the default tf-idf similarity or the one that I set?
When Iam searching for a particular term in my index, I'm getting the results with less occurrence of the searched term than the the results with higher occurrence of the searched term.
Is there any way such that I can score documents based on the term frequency alone and not the inverse document frequency.
Yes it is possible. Actually the scoring algorithm depends on the type of query, usually it is indeed TF-IDF but you can use script scoring, you simply write a simple script that determines how the score should be calculated. You simply return in the script the field which is inside the document which represents the term frequency.
You can find more info how to do that here.
I'm trying to determine the similarity between two documents using carrot. Is it possible get this similarity directly from the framework?
Additionally I've been studying the tf-idf matrix and realized that the rows correspond to the stemmed all words and columns to documents. However, how can I identify which document corresponds to which column?
For example, suppose a list of documents, the column order will be the order of the documents in the list?
Ex:
List docs = {doc1, doc2, doc3}
and
Column 0 = doc1
Coluns 1 = doc2
...
Is this?
Carrot2 does not use the conventional notion of document-document similarity, so you won't find it there. You can indeed use the term-document matrix to compute all sorts of document-document similarity.
You are correct in assuming that the columns of the term-document matrix are in the same order as the documents in the input list. You can check the source code to clear any other doubts.