Indexing several vectors per document in Elasticsearch - elasticsearch

I am using Elasticsearch to compute the cosine similarity between paragraphs and search queries. However, the tutorials I find online seem to indicate that you can only have one vector per indexed document.. That is unfortunate for my use case since each document contains multiple paragraphs, and thus have multiple vectors associated with them.
So, when using similarity metrics like cosine similarity or k-nearest neighbors, what is the best way to deal with this? Do I add multiple JSONs per document to the index? That is, one for each vector?
Or, is there a smarter way to do this?

Related

ElasticSearch / Lucene: automata for the dictionary of terms

Issue: I need to highlight matched terms. Out-of-the-box solution cannot be applied due to the fact we don't keep sources inside ES.
Possible solution:
Retrieve ids from ES by search query
Retrieve sources by ids
Match source with query word by word using LevinsteinDistance algorithm or lucene FSM class
Considering we don't retrieve a lot of content at a time it should not consume a lot of time.
The question is the following:
Does Lucene library contain FSM/automata to represent a dictionary? The desired solution: to get lucene automata representing the dictionary and feed the query to it term by term. Automata should accept terms that are contained in the dictionary. Edit Distance should be considered as well.
Searching for the solution I found lucene classes like LevenshteinAutomata and FuzzyQuery. But LevenshteinAutomata (as I understood) represents only one term. So for several terms I need several automata.

Is there a way to show at what percentage a selected document is similar to others on ElasticSearch?

I need to build a search engine using Elasticsearch and the steps will be as following:
Search on the search engine with a search string.
The relevant results will display and I can click on these documents.
If I select a document, I will be redirected to another page where I will see all the details of the documents and will have an option "More Like This" (which will return documents similar to the selected document). I know that this is done using the MLT query.
Now my question is: Except for returning documents similar to the selected one, how can I also return at what percentage the documents are similar to the selected one?
There are a couple of things you can do.
using function_score query
more_like_this query is essentially a full text search, and it returns documents ordered by their relevance score. It could be possible to convert the score directly to a percentage, but it is not advised (here
and more specifically here).
Instead one can define a custom score with help of a function_score query, which can be designed so it returns a meaningful percentage.
This, of course, comes with additional cost of complexity, and the definition of "similarity" becomes more of an art than of science.
using dense_vector
One may opt to use the (yet experimental) dense_vector data type, which allows storing and comparing dense vectors (that is, arrays of numbers of fixed size). Here's an article that describes this approach very well: Text similarity search with vector fields.
In this case the definition of similarity is as precise as it can possibly be: a distance of two vectors in a multidimensional space, which can be computed via, for instance, cosine similarity.
However, such dense vectors have to be somehow computed, and the quality of said vectors will equal the quality of the similarity itself.
As the bottom line I must say that to make this work with Elasticsearch a bunch of computation and logic should be added outside, either in form of pre-computed models, or custom curated scoring algorithms. Elasticsearch out of the box does not seem to be a good percentage-similarity kind of deal.
Hope that helps!
If you're going the route of using semantic search via dense_vector, as Nikolay mentioned, I would recommend NBoost. NBoost has a good out-of-the-box systems for improving Elasticsearch results with SOTA models.

Computing Document Similarity Matrices in Sphinx?

Does Sphinx provide a way to precompute document similarity matrices? I have looked at Sphinx/Solr/Lucene; it seems Lucene is able to do this indirectly using Term Vectors Computing Document Similarity with Term Vectors.
Currently I am using the tf-idf-similarity gem to do these calculations, but it is incredibly slow as the dataset grows; something like On^(n-1!).
Currently trying to find a faster alternative to this. Lucene seems like a potential solution, but it doesn't have as much support within the Ruby community so if Sphinx has a good way of doing this that would be ideal.
Just to clarify; I am not trying to do live search similarity matching, which appears to be the most common use case for both Lucene and Sphinx, I am trying to precompute a similarity matrix that will create a similarity between all documents with the dataset. This will subsequently be used in data visualizations for different types of user analysis.
Also anyone with prior experience doing this I'm curious about benchmarks. How it looks in terms of time to process and how much computing power and/or parallelization were you using with reference to number of documents and doc size average.
Currently it is taking about 40 minutes for me to process roughly 4000 documents and about 2 hours to process 6400 records. Providing the 2 different sizes and times here to give an indication of the growth expansion so you can see how slow this would become with significantly large datasets.

Score equivalence Oracle Text / Lucene

Is there an equivalence between the scores an Oracle Text Score would calculate and a Lucene one ?
Would you be able to mix the sources to get one unified resultset through the score ?
Scores are not comparable between queries or data changes in Lucene, much less being comparable to another technology. Lucene scores of the same document can be changed dramatically by having other documents added or removed from the index. Scoring as a percentage of maximum becomes the obvious solution, but the same problems remain, as well as that other algorithms in another technology will ikely render different distribution. You can read about why you should not compare scores like this here and here
A way I managed to lash something similar together was to fetch matches from the other data source, and create a temporary index in a RAMDirectory, and then search again incorporating it with a MultiSearcher. That way everything is getting scored on a single, cohesive data set, within a single search. Scoring should be reasonable enough, though this isn't exactly the most efficient way to search.

Efficient Phrase Matching Algorithm

I have a set of about 7 Million phrases to be matched to about 300 Million queries.
Queries can be sub-strings or contain the phrases themselves. Basically I want a measure of 'similarity' between two phrases [ not necessarily the edit distance ]
Can someone give some pointers to efficient algorithms to do this. I would prefer distributed algorithms since I am going to do this on Hadoop via streaming using python.
Bed trees look interesting
Bed-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance (Pdf of presentation)
This a at least not very trivial, because you have on the one side very much data and on the other side even more.
The ultra simplest approach would be a lucene index on the 7 mio. phrases and let the hadoop job query the index. Not quite sure if you need a solr server for that, or any similar implementations in python.
The mapper should write out the phrase id or linenumber, whatever you have to indentify it. Or at least the phrase itself, along with the matchingscore.
In the reduce step you could go for a reducing on a phrase key and write out all the related phrases with the score. (or whatever you want to)
For similarity you can read further here:
Similarity of Apache Lucene
Apache Lucene itself

Resources