Alternatives for real time score by popularity with elasticsearch

Alternatives for real time score by popularity with elasticsearch - elasticsearch

I would like boost a document's score by popularity. I'd like it to be as real-time as possible.
In order to meet the real time requirement, it seems I have to re-index each document each time it's popularity changes (per view). This seems highly inefficient.
An alternative is to run a batch process that periodically re-indexes documents that have been recently viewed, but this becomes less real-time, and still requires re-indexing entire documents when only one field (the popularity) has changed.
A third approach (which we have implemented) is to use a plugin to grab a document's popularity from an external source and use a script to include it in scoring. This works as well, but slows down search for large document spaces. Using rescore helps, but it only allows us to sort a subset of the documents returned.
Is there a better option (a way to add popularity to the index without reindexing the entire document or a better way to integrate external data with elastic search)?

You can try the following to have realtime popularity field.
Include a popularity field as part of your index.
Increment popularity every time a document is retrieved. You can do this using partial update scripts.
Use function score query to boost the document.
Java API:
new FunctionScoreQueryBuilder(matchQuery("canonical_name",
phrase).analyzer("standard")
.minimumShouldMatch("100%")).add(
fieldValueFactorFunction("popularityScore")
.modifier(Modifier.LOG1P).factor(2f))
.boostMode("sum"))
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/boosting-by-popularity.html

We implemented a hybrid of your second and third approach. We had an external source (in our case a DB) that stored popularity values for a doc id and all queries regarding popularity where served from there. Additionaly we had a cron that updated all documents every hour by reindexing. The reason we reindexed is because we had other analysis done on the document that needed the new popularity but technically you can only have the db as it serves all request purposes.
DB are genearly faster when it comes to number retrieval for a doc id than eelstic search/lucene/solr. Hope this helps.

I know this is a old question, but Elasticsearch has released a experimental feature where you can provide ranks per document in the search query:
https://www.elastic.co/blog/made-to-measure-how-to-use-the-ranking-evaluation-api-in-elasticsearch
Basically, if you believe that some documents will be returned from a certain search query, you can provide those documents (their ids) along with a rank (per document) in the search query. If a provided document id is within the search result, its rank will be used to boost itself.
Since you have to provide an array of document ids and their ranks in the search query, you need some way to determine (beforehand) if these documents are expected in the search result.
This feature just seems the wrong way around at first, since you need to figure out potential results before you execute the actual search. But maybe it's something. It's real time at least.
https://www.elastic.co/guide/en/elasticsearch/reference/6.7/search-rank-eval.html

Related

Does reading an elastic document by _id count as a search for the `refresh_interval`

In the write tuning section, Elastic recommends to Increase the Refresh Interval
We're doing document ingestions where during ingestion we may do reads, essentially like,
GET /my-index/_doc/mydocumentid
that is, a read of the document by its _id, as opposed to a search. Some descriptions suggest that the document id is just added to the Lucene index like other attributes. Does this mean that the read by id would still reset the refresh_interval and force a re-index instead of allowing it to wait for the full refresh_interval?

This is actually a tricky one:
You are correct that a GET on an _id works right away (unlike a multi-document operation like a search, which need to wait for an explicit ?refresh from you or the refresh_interval). But the underlying implementation changed twice:
Initially the GET on an _id read the data right from the translog, so it didn't need a refresh / the creation of a segment.
The code was complex and so we changed it in 5.0 that it would be read from a segment, but a GET on an _id would automatically trigger the _refresh. So it looked the same on the outside and the code was simpler.
But for use-cases that did a lot of GETs on _id this was expensive, since it creates lots of tiny shards. So we changed it back in 7.6 to read again from the translog.
So if you are using a current version, it doesn't trigger a _refresh.

a get on the _id is not a search, so no

Check if document is part of Elasticsearch query?

Curious if there is some way to check if document ID is part of a large (million+ results) Elasticsearch query/filter.
Essentially I’ll have a group of related document ID’s and only want to return them if they are part of a larger query. Hoping to do database side. Theoretically seemed possible since ES has to cache stuff related to large scrolls.

It's a interesting use-case but you need to understand that Elasticsearch(ES) doesn't return all the matching documents ids in the search result and return by default only the 10 documents in the response, which can be changed by the size parameter.
And if you increase the size param and have millions of matching docs in your query then ES query performance would be very bad and it might bring even entire cluster down if you frequently fire such queries(in absence of circuit breaker) so be cautious about it.
You are right that, ES cache the stuff, but again that if you try to cache huge amount of data and that is getting invalidate very frequent then you will not get the required performance benefits, so better do the benchmark against it.
You are already on the correct path to use, scroll API to iterate on millions on search result, just see below points to improve further.
First get the count of search result, this is included in default search response with eq or greater value which will give you idea that how many search results you have based on which you can give size param for subsequent calls to see if your id is present or not.
See if you effectively utilize the filters context in your query, which is by default cached at ES.
Benchmark your some heavy scroll API calls with your data.
Refer this thread to fine tune your cluster and index configuration to optimize ES response further.

ElasticSearch: given a document and a query, what is the relevance score?

Once a query is executed on ElasticSearch, a relevance _score is calculated for each retrieved document.
Given a specific document (e.g. by doc ID) and a specific query, I would like to see what is its _score?
One way is perhaps to query ES, retrieve all the hit documents, and look up the desired document out of all the retrieved documents to see its score.
I assume there should be a more efficient way to do this. Given a query and a document ID, what is its _score?
I'm using ElasticSearch 7.x
PS: I need this for a learning-to-rank scenario (to create my judgment list). I have in fact a complex query that was created from various should and must over different fields. My major requirement was to get the score value for each individual sub-query, which seems there is no solution for it. I want to understand which part of this complex query is more useful and which one is less. The only way I've come up with is to execute each sub-query separately to get the score but I do not want to actually execute that query just asking for what is the score of a specific document for that sub-query.

Scoring of the document is not only related to just the document and all other documents in the index, but it also depends on various factor like:
_score is calculated per shard basis not on an index basis by default, although you can change this behavior by using DFS Query Then Fetch param in your query. More info on this official blog.
Is there is any boost applied at index or query time(index time is deprecated from 5.X).
Any custom scoring function is used in addition to the default ES scoring algorithm(tf/idf in old versions) and BM25 in the latest versions.
Edit: Based on the comments from the other respected community members, rephrasing the below statement:
To answer your question, Using the _explain API, you can understand how Elasticsearch computes a score explanation for a query and a specific document. This can give useful feedback on whether a document matches or didn’t match a specific query.

How does ElasticSearch handle an index with 230m entries?

I was looking through elasticsearch and was noticing that you can create an index and bulk add items. I currently have a series of flat files with 220 million entries. I am working on Logstash to parse and add them to ElasticSearch, but I feel that it existing under 1 index would be rough to query. The row data is nothing more than 1-3 properties at most.
How does Elasticsearch function in this case? In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
I have been walking through the documentation, and it is explaining what to do, but not necessarily all the time explaining why it does what it does.

In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
That is exactly what you need to do. Typically it's an iterative process:
start by putting a subset of the data in. You can also put in all the data, if time and cost permit.
put some search load on it that is as close as possible to production conditions, e.g. by turning on whatever search integration you're planning to use. If you're planning to only issue queries manually, now's the time to try them and gauge their speed and the relevance of the results.
see if the queries are particularly slow and if their results are relevant enough. You change the index mappings or queries you're using to achieve faster results, and indeed add more nodes to your cluster.
Since you mention Logstash, there are a few things that may help further:
check out Filebeat for indexing the data on an ongoing basis. You may not need to do the work of reading the files and bulk indexing yourself.
if it's log or log-like data and you're mostly interested in more recent results, it could be a lot faster to split up the data by date & time (e.g. index-2019-08-11, index-2019-08-12, index-2019-08-13). See the Index Lifecycle Management feature for automating this.
try using the Keyword field type where appropriate in your mappings. It stops analysis on the field, preventing you from doing full-text searches inside the field and only allowing exact string matches. Useful for fields like a "tags" field or a "status" field with something like ["draft", "review", "published"] values.
Good luck!

Data model for fields that change frequently in ElasticSearch

What is the best way to deal with fields that change frequently inside a document for ElasticSearch? Per their docs about partial updates...
Internally, however, the update API simply manages the same retrieve-change-reindex process that we have already described.
In particular, what should be done when the indexing of the document will likely be expensive given the number of indexed field and the size of some of the text fields that have to be analyzed?
As a concrete example, use SO's view and vote counts on questions and answers. It would seem expensive to reindex the text body just to update those values.

Maybe you shouldn't update so frequently. Perhaps things like vote/views should only be periodically updated in ES, while more critical fields like answers/questions be pushed immediately. Consider what's most important and see if you can get away with some level of staleness.
ElasticSearch is great for text search, but I would not consider ES to support SO in its entirety (or similar applications). It could be a useful tool for searching for answers/questions on SO, or for internal applications (like log/event analysis). But perhaps the actual serving of data could be better done with a different solution? Maybe it should be powered by Cassandra instead for the bulk of the work? You get the idea...
If you want to use ES as a solution to your needs, and you MUST update frequently, you could definitely consider the parent/child model mentioned already. of course, that method will require more memory/disk space, and it will take up more cpu/time when you query for totals. An alternative would be to have the parent store searchable fields, and let the child hold the metadata (where the child's fields are not analyzed). this will allow you to make frequent updates without having to undergo an expensive re-index, since there is nothing to index.
You could also consider what I mentioned above and see if you can get away with some staleness. This can be done in many ways too. You can throttle your requests by type of change, or change the refresh/flush interval, or consider de-duping updates if you are sending updates in bulk. These too have their shortcomings...

I think best way to handle the change is to split the document (you can use Parent child relationship, or just have parent id), and make document as small as possible (moving changeable part to new types) .
This can be a way to accomplish your requirement say SO,
You can use multiple types for this, consider This post (Views and Vote count).
Create a type for post, view and vote.
For a post , index a document to post type (index post id, title description tag), and for every view of that post you can index a document to view type (with id of post), and if voted you can index vote with (no of votes , id of post and other info you need [like positive or negative flag] ) to vote type.
So, to get views for post, use filter of post id, and get document counts in views type
To get no of votes, use stat aggregation for no of votes , or terms aggregation followed by stat aggregation for getting positive and negative votes.
This is way I think is best, and there can be other opinion too.
Thanks

What I do is that I use a database like mongo or mysql for storing properties that get updated frequently and use elastic search to store documents for text searching.
Example: I want to keep data about a book and its contents and I also want to keep the total number of views, updating and reindexing the document each time a user views it is a total overkill.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio