BM25 similarity with binary term frequencies in Elasticsearch - elasticsearch

Does anybody tried to customize BM25 similarity used in Elasticsearch in a following way?
This is a common BM25 score. I want term frequencies to be binary (0 if a term is not presented in the document and 1 if term frequency in the document if greater 0). So in the pic below I want tf(q_i, d) to be {0, 1}.
Any ideas what is the easiest way to achieve this in Elasticsearch?

One way to achieve this is to use the Unique Token Filter which will index only unique tokens during analysis.
This should be equivalent to having a term frequency of 1 in the document if a token exists.

Related

What similarity measure does More Like This (mlt) score words with in ElasticSearch?

If I set a field's similarity function to be something like BM25, does More Like This use that similarity score to pick top words for it's disjunctive boolean search (or is the default tf-idf used, as suggested by the docs: [MLT] selects the top K terms with highest tf-idf to form a disjunctive query of these terms.)? And separately, does the returned order/score reflect the default tf-idf similarity or the one that I set?

Score documents by term frequency alone and not inverse document frequency in elasticsearch

When Iam searching for a particular term in my index, I'm getting the results with less occurrence of the searched term than the the results with higher occurrence of the searched term.
Is there any way such that I can score documents based on the term frequency alone and not the inverse document frequency.
Yes it is possible. Actually the scoring algorithm depends on the type of query, usually it is indeed TF-IDF but you can use script scoring, you simply write a simple script that determines how the score should be calculated. You simply return in the script the field which is inside the document which represents the term frequency.
You can find more info how to do that here.

Fuzzy string matching using elasticsearchand return the edit distance or similarity score

I'm trying to do a match query with fuzzy results and order the results based on edit distance. However elasticsearch returns a relevance score (_score) which is based on frequency and each query. Is there anyway to obtain only the edit distance from elasticsearch. Also will writing my own custom function to calculate the edit distance slow down the search?

Carrot2 documents similarity and how are the ordered documents indexes in the tf-idf matrix?

I'm trying to determine the similarity between two documents using carrot. Is it possible get this similarity directly from the framework?
Additionally I've been studying the tf-idf matrix and realized that the rows correspond to the stemmed all words and columns to documents. However, how can I identify which document corresponds to which column?
For example, suppose a list of documents, the column order will be the order of the documents in the list?
Ex:
List docs = {doc1, doc2, doc3}
and
Column 0 = doc1
Coluns 1 = doc2
...
Is this?
Carrot2 does not use the conventional notion of document-document similarity, so you won't find it there. You can indeed use the term-document matrix to compute all sorts of document-document similarity.
You are correct in assuming that the columns of the term-document matrix are in the same order as the documents in the input list. You can check the source code to clear any other doubts.

tf-idf: am I understanding it right?

I am interested in doing some document clustering, and right now I am considering using TF-IDF for this.
If I am not wrong, TF-IDF is particularly used for evaluating the relevance of a document given a query. If I do not have a particular query, how can I apply tf-idf to clustering?
For document clustering. the best approach is to use k-means algorithm. If you know how many types of documents you have you know what k is.
To make it work on documents:
a) say choose initial k documents at random.
b) Assign each document to a cluser using the minimum distance for a document with the cluster.
c) After documents are assigned to the cluster make K new documents as cluster by taking the centroid of each cluster.
Now, the question is
a) How to calculate distance between 2 documents: Its nothing but cosine similarity of terms of documents with initial cluster. Terms here are nothing but TF-IDF(calculated earlier for each document)
b) Centroid should be: sum of TF-IDF of a given term/ no. of documents. Do, this for all the possible terms in a cluster. this will give you another n-dimensional documents.
Hope thats helps!
Not exactly actually: tf-idf gives you the relevance of a term in a given document.
So you can perfectly use it for your clustering by computing a proximity which would be something like
proximity(document_i, document_j) = sum(tf_idf(t,i) * tf_idf(t,j))
for each term t both in doc i and doc j.
TF-IDF serves a different purpose; unless you intend to reinvent the wheel, you are better of using a tool like Carrot. Googling for document clustering can give you many algorithms if you wish to implement one on your own.

Resources