How to periodically update ElasticSearch index? UPDATE or REBUILD? - elasticsearch

Suppose there are fixed 1 million titles of articles to be indexed, and each article has two growing fields with "type": "rank_feature": view_count and favor_count. I have to update the value of counters every hours in order to boosting hot articles in search result.
Since the UPDATE operation in ES and Lucene is equivalent to search-delete-create, I wonder what is the proper solution in my case. Does UPDATE operation save unnecessary ANALYZE steps for those fixed titles?

An update does not make your analysis more efficient — it still has to process the entire document again.
If you have 2 fields that change frequently and other fields that are more static, I'd restructure the documents to use parent/child:
The parent contains the static fields
The child has your 2 frequently changing fields
That way you can avoid the (re) analysis of documents as much as possible. This comes at the cost of some overhead at search-time, but should be manageable if you only have a single child.

Related

Elastic Search Number of Document Views

I have a web app that is used to search and view documents in Elastic Search.
The goal now is to maintain two values.
1. How many times the document was fetched in total (life time views)
2. How many times the document was fetched in last 30 days.
Achieving the first is somewhat possible, but the second one seems to be a very hard problem.
The two values need to be part of the document as they will be used for sorting the results.
What is the best way to achieve this.
To maintain expiring data like that you will need to store each view with its timestamp. I suppose you could store them in an array in the ES document, but you're asking for trouble doing it like that, as the update operation that you'd need to call every time the document is viewed will have to delete and recreate the document (that's how ES does updates), and if two views happen at the same time it will be difficult to make sure they both get stored.
There are two ways to store the views, and make use of them in the query:
Put them in a separate store (could be a different index in ES if you like), and run a cron job or similar every day to update every item in the main index with the number of views from the last thirty days in the view store. Even with a lot of data it should be possible to make this quite efficient, depending on your choice of store for views.
Use the ElasticSearch parent/child datatype to store views in the same index as the main documents, as children. I'm not sure that I'd particularly recommend this approach, but I think it should be possible with aggregations to write a query that sorts primary documents by the number of children (filtered by date). It might be quite slow though.
I doubt there is any other way to do this with current versions of ES, because it doesn't support joining across indices. Either the data must be aggregated in advance onto the document, or it has to be available in the same index.

Elasticsearch: Is having collapsed documents on the same shard improve performance while collapsing?

Elasticsearch Parent/Child nested relationship impose having the parent and children on the same shard by using the _routing field during ingesting.
I was wondering if using the same process would provide performance improvement while using the collapse feature of elasticsearch or would it make it worst?
If we look at both cases:
1) Routing to the same shard: the shard is able to do the collapsing on its own and return already fully collapsed documents
2) Document are on many shards: the collapse can only happen later with all shards returning lots of documents that will be collapsed later.
I do not know if elasticsearch will do the 2nd even though documents where on the same shard.
Thanks.
The full genesis of field collapsing (introduced in ES 5.3) can be found in PR 22337 (issue 21833).
Initially, the idea was to create a new top_groups aggregation, modeled after a terms+top_hits combo, but in the end it was deemed to costly to implement and not necessarily optimal.
Field collapsing has finally been implemented in the search layer, because it can benefit from the existing query/fetch phases and requires a lot less memory that doing it as an aggregation. Also pagination would work out of the box as well.
It was discussed whether it would be a good idea to use the grouping field as a routing key to make sure all top hits were located on the same shard, but in the end this was deemed too big a limitation.
So, long story short, with field collapsing there is no such restriction to locate all documents on the same shard because the fetch request (phase 2) will be sent to all shards anyway.
As always, the best way is to try it out for yourself and measure the performance.
1 index with 1 shard (with and without routing key)
1 index with several shard (with and without routing key)
My take is that it would make no big difference, because only the top hits are collapsed and a normal search query (without field collapsing) would go through both query/fetch phases as well anyway.

How does ElasticSearch handle an index with 230m entries?

I was looking through elasticsearch and was noticing that you can create an index and bulk add items. I currently have a series of flat files with 220 million entries. I am working on Logstash to parse and add them to ElasticSearch, but I feel that it existing under 1 index would be rough to query. The row data is nothing more than 1-3 properties at most.
How does Elasticsearch function in this case? In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
I have been walking through the documentation, and it is explaining what to do, but not necessarily all the time explaining why it does what it does.
In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
That is exactly what you need to do. Typically it's an iterative process:
start by putting a subset of the data in. You can also put in all the data, if time and cost permit.
put some search load on it that is as close as possible to production conditions, e.g. by turning on whatever search integration you're planning to use. If you're planning to only issue queries manually, now's the time to try them and gauge their speed and the relevance of the results.
see if the queries are particularly slow and if their results are relevant enough. You change the index mappings or queries you're using to achieve faster results, and indeed add more nodes to your cluster.
Since you mention Logstash, there are a few things that may help further:
check out Filebeat for indexing the data on an ongoing basis. You may not need to do the work of reading the files and bulk indexing yourself.
if it's log or log-like data and you're mostly interested in more recent results, it could be a lot faster to split up the data by date & time (e.g. index-2019-08-11, index-2019-08-12, index-2019-08-13). See the Index Lifecycle Management feature for automating this.
try using the Keyword field type where appropriate in your mappings. It stops analysis on the field, preventing you from doing full-text searches inside the field and only allowing exact string matches. Useful for fields like a "tags" field or a "status" field with something like ["draft", "review", "published"] values.
Good luck!

ElasticSearch Frequent Updates

We have a rather difficult set of requirements for our search engine replacement and they go as follows.
Every instance will have a unique schema, we have multiple client installations that we don't control that have varying data structures
Frequent updates, it's not uncommon for every record to have a field be updated in a single action. Some fields are updated frequently, others are never changed
Some of our fields can be very large (50mb+) though these are never changed and are rare in a data set.
We'd like to have near real-time search if possible
We're looking at making the fields that are updated semi-frequently/frequently into child documents. The issue with this is that we have a set of tags that change quite frequently on the record that we want to search against in near real time. There is a strong expectation in our application that when this data is modified that searching immediately reflect that. We've tried child documents, but they don't seem to update as quickly as we'd like over a large data set.
So the questions are as follows:
Are there strategies I'm not aware of for updating child documents quickly? Maybe a plugin? Right now we're only using the RESTFUL interface
Would it be better to store the data that isn't frequently changed in ES but keep the tags in a database? Possibly creating a plugin in ES that maps the two together? Would this plugin in be difficult? Ideally, we'd be able to mix our searches together (Tags+regular ES queries) in a boolean fashion including the tags stored in a table.
Hopefully this will be helpful to other people in this situation, here is the solution I came up with.
Use Child/Parent documents
There was a single parent that contained static information for the record that rarely/never changes (bulk of the data indexed)
Create child documents for other data I wanted to index so they could be indexed independently of the primary document
Since I had split the record data I wanted to index into static and non static documents, then broke that non static data into further child documents I was able to create a high throughput indexer. The total number of records to be indexed were split into sub chunks, which were then further split into their child document types. I would split these chunks out to various indexer instances which would then be only limited by the throughput of the data source or the ES cluster in determining how many documents could be indexed per second.
This was all done through the bulk API. Keeping the static data away from the frequently changing data allowed the frequently changed data to be updated quite quickly and this speed was only limited by the available hardware. It was a little tougher to craft queries using the child document clauses and aggregates but everything seemed to work.
Notes
There is a performance penalty to using parent/child documents which was a non issue for us considering what ES gave us over our previous solution but it may cause issues for other implementations.

Alternatives for real time score by popularity with elasticsearch

I would like boost a document's score by popularity. I'd like it to be as real-time as possible.
In order to meet the real time requirement, it seems I have to re-index each document each time it's popularity changes (per view). This seems highly inefficient.
An alternative is to run a batch process that periodically re-indexes documents that have been recently viewed, but this becomes less real-time, and still requires re-indexing entire documents when only one field (the popularity) has changed.
A third approach (which we have implemented) is to use a plugin to grab a document's popularity from an external source and use a script to include it in scoring. This works as well, but slows down search for large document spaces. Using rescore helps, but it only allows us to sort a subset of the documents returned.
Is there a better option (a way to add popularity to the index without reindexing the entire document or a better way to integrate external data with elastic search)?
You can try the following to have realtime popularity field.
Include a popularity field as part of your index.
Increment popularity every time a document is retrieved. You can do this using partial update scripts.
Use function score query to boost the document.
Java API:
new FunctionScoreQueryBuilder(matchQuery("canonical_name",
phrase).analyzer("standard")
.minimumShouldMatch("100%")).add(
fieldValueFactorFunction("popularityScore")
.modifier(Modifier.LOG1P).factor(2f))
.boostMode("sum"))
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/boosting-by-popularity.html
We implemented a hybrid of your second and third approach. We had an external source (in our case a DB) that stored popularity values for a doc id and all queries regarding popularity where served from there. Additionaly we had a cron that updated all documents every hour by reindexing. The reason we reindexed is because we had other analysis done on the document that needed the new popularity but technically you can only have the db as it serves all request purposes.
DB are genearly faster when it comes to number retrieval for a doc id than eelstic search/lucene/solr. Hope this helps.
I know this is a old question, but Elasticsearch has released a experimental feature where you can provide ranks per document in the search query:
https://www.elastic.co/blog/made-to-measure-how-to-use-the-ranking-evaluation-api-in-elasticsearch
Basically, if you believe that some documents will be returned from a certain search query, you can provide those documents (their ids) along with a rank (per document) in the search query. If a provided document id is within the search result, its rank will be used to boost itself.
Since you have to provide an array of document ids and their ranks in the search query, you need some way to determine (beforehand) if these documents are expected in the search result.
This feature just seems the wrong way around at first, since you need to figure out potential results before you execute the actual search. But maybe it's something. It's real time at least.
https://www.elastic.co/guide/en/elasticsearch/reference/6.7/search-rank-eval.html

Resources