Something "Materialized view"-like in ElasticSearch - elasticsearch

I have a query which runs every time a website is loaded. This Query aggregates over three different term-fields and around 3 million documents and therefore needs 6-7 seconds to complete. The data does not change that frequently and the currentness of the result is not critical.
I know that I can use an alias to create something "View" like in the RDMS world. Is it also possible to populate it, so the query result gets cached? Is there any other way caching might help in this scenario or do I have to create an additional index for the aggregated data and update it from time to time?

I know that the post is old, but about view, elastic add the Data frames in the 7.3.0.
You could also use the _reindex api
POST /_reindex
{
"source": {
"index": "live_index"
},
"dest": {
"index": "caching_index"
}
}
But it will not change your ingestion problem.
About this, I think the solution is sharding for your index.
with 2 or more shards, and several nodes, elastic will be able to paralyze.
But an easier thing to test is to disable the refresh_interval when indexing and to re-enable it after. It generally improve a lot the ingestion time.
You can see a full article on this use case on
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html

You create materialised view.Its a table eventually which has data of aggregated functions. As you have already inserted the aggregated data ,now when you query it, it will be faster. I feel there is no need to cache as well.Even i have created the MVs , it improves the performance tremendously. Having said that you can even go for elastic search as well where you can cache the aggregated queries if your data is not changing frequently.I feel MV and elastic search gives the same performance.

Related

How to design a system for Search query and Csv/Pdf export for 500GB data/day?

Problem Stmt
1 device is sending 500GB text data (logs) per day to my central server.
I want to design a system using which user can:
Apply exact-match filters and go through data using pagination
Export PDF/CSV reports for same query as above
Data can be stored for max 6 months. Its an on-premise solution. Some delay on queries is affordable. If we can do data compressions it would be awesome. I have 512GB RAM, 80core system and TBs of storage(these are upgradable)
What I have tried/found out:
Tech stack iam planning to use: MEAN stack for application dev. For core data part iam planning to use ELK stack. Elasticsearch single index can have <40-50gb ideal size recommendation.
So, my plan is create 100 indexes per day each of 5GB for each device. During query I can sort these indices based on their name (eg. 12_dec_2012_part_1 ...) and search into each index linearly and keep on doing this till the range user has asked. (I think this will hold good for ad-hoc request by user, but for reports if I do this and write to a csv file by going sequentially one by one it will take long time.) For reports I think best thing i can do is create pdf/csv for each index(5gb size), reason because most file openers cannot open very large csv/pdf files.
Iam new to big data problems. Iam not sure what approach is right; ELK or Hadoop ecosystem for this. (I would like to go with ELK)
Can someone point me to right direction or how to proceed or if someone has dealt with this type of problem statement? Any out of the way solution for these problems are also welcome.
Thanks!
exact-match filters
You can use term query or match_phrase query
Returns documents that contain an exact term in a provided field.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html
pagination
You can use from and size parameter to pagination.
GET /_search
{
"from": 5,
"size": 20,
"query": {
"match": {
"user.id": "kimchy"
}
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html
Export PDF/CSV
You can use Kibana
Kibana provides you with several options to share Discover saved
searches, dashboards, Visualize Library visualizations, and Canvas
workpads.
https://www.elastic.co/guide/en/kibana/current/reporting-getting-started.html
Data can be stored for max 6 months
You can use ILM policy
You can configure index lifecycle management (ILM) policies to
automatically manage indices according to your performance,
resiliency, and retention requirements. For example, you could use ILM
to:
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html
optimal shard size
For log indices you can use datastream indices.
A data stream lets you store append-only time series data across
multiple indices while giving you a single named resource for
requests. Data streams are well-suited for logs, events, metrics, and
other continuously generated data.
https://www.elastic.co/guide/en/elasticsearch/reference/current/data-streams.html
When you use datastream indices you don't think about shard size it
will rollover automatically. :)
For the compression you should update index settings
index.codec: best_compression
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html

How does ElasticSearch handle an index with 230m entries?

I was looking through elasticsearch and was noticing that you can create an index and bulk add items. I currently have a series of flat files with 220 million entries. I am working on Logstash to parse and add them to ElasticSearch, but I feel that it existing under 1 index would be rough to query. The row data is nothing more than 1-3 properties at most.
How does Elasticsearch function in this case? In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
I have been walking through the documentation, and it is explaining what to do, but not necessarily all the time explaining why it does what it does.
In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
That is exactly what you need to do. Typically it's an iterative process:
start by putting a subset of the data in. You can also put in all the data, if time and cost permit.
put some search load on it that is as close as possible to production conditions, e.g. by turning on whatever search integration you're planning to use. If you're planning to only issue queries manually, now's the time to try them and gauge their speed and the relevance of the results.
see if the queries are particularly slow and if their results are relevant enough. You change the index mappings or queries you're using to achieve faster results, and indeed add more nodes to your cluster.
Since you mention Logstash, there are a few things that may help further:
check out Filebeat for indexing the data on an ongoing basis. You may not need to do the work of reading the files and bulk indexing yourself.
if it's log or log-like data and you're mostly interested in more recent results, it could be a lot faster to split up the data by date & time (e.g. index-2019-08-11, index-2019-08-12, index-2019-08-13). See the Index Lifecycle Management feature for automating this.
try using the Keyword field type where appropriate in your mappings. It stops analysis on the field, preventing you from doing full-text searches inside the field and only allowing exact string matches. Useful for fields like a "tags" field or a "status" field with something like ["draft", "review", "published"] values.
Good luck!

Updating all data elasticsearch

Is there any way to update all data in elasticsearch.
In below example, update done for external '1'.
curl -XPOST 'localhost:9200/customer/external/1/_update?pretty' -d '
{
"doc": { "name": "Jane Doe", "age": 20 }
}'
Similarly, I need to update all my data in external. Is there any way or query to updating all data.
Updating all documents in an index means that all documents will be deleted and new ones will be indexed. Which means lots of "marked-as-deleted" documents.
When you run a query ES will automatically filter out those "marked-as-deleted" documents, which will have an impact on the response time of the query. How much impact it depends on the data, use case and query.
Also, if you update all documents, unless you run a _force_merge there will be segments (especially the larger ones) that will still have "marked-as-deleted" documents and those segments are hard to be automatically merged by Lucene/Elasticsearch.
My suggestion, if your indexing process is not too complex (like getting the data from a relational database and process it before indexing into ES, for example), is to drop the index completely and index fresh data. It might be more effective than updating all the documents.

Elastic search preference set to custom value, document still returned from different shards

I'm having issue with scoring: when I run the same query multiple times, each documents are not scored the same way. I found out that the problem is well known, it's the bouncing result issue.
A bit of context: I have multiple shards across multiple nodes (60 shards, 10 data nodes), all the nodes are using ES 2.3 and we're heavily using nested document - the example query doesn't use them, for simplicity.
I tried to resolve it by using the preference search parameter, with a custom value. The documentation states:
A custom value will be used to guarantee that the same shards will be used for the same custom value. This can help with "jumping values" when hitting different shards in different refresh states. A sample value can be something like the web session id, or the user name.
However, when I run this query multiple times:
GET myindex/_search?preference=asfd
{
"query": {
"term": {
"has_account": {
"value": "twitter"
}
}
}
}
I end up having the same documents, but with different scoring/sorting. If I enable explain, I can see that those documents are coming from different shards.
If I use preference=_primary or preference=_replica, we have the expected behavior (always the same shard, always the same scoring/sorting) but I can't query only one or the other...
I also experimented with search_type=dfs_search_then_fetch, which should generate the scoring based on the whole index, across all shards, but I still get different scoring for each run of the query.
So in short, how do I ensure the score and the sorting of the results of a query stay the same during a user's session?
Looks like my replicas went out of sync with the primaries.
No idea why, but deleting the replicas and recreating them have "fixed" the problem... I'll need some investigations on why it went out of sync
Edit 21/10/2016
Regarding the "preference" option not being taken into account, it's linked to the AWS zone awareness: if the preferred replica is in another zone than the client node, then the preference will be ignored.
The differences between the replicas are "normal" if you delete (or update) documents, from my understanding the deleted document count will vary between the replicas, since they're not necessarily merging segments at the same time.

Is it possible to limit a size of an Elasticsearch index?

I have an Elasticsearch instance for indexing log records. Naturally the data grows over time and I would like to limit its size(about 10GB). Something like a mongoDb capped collection.
I'm not interested in old log records anyway.
I haven't found any config for this and I'm not sure that I can just remove data files.
any suggestions ?
The Elasticsearch "way" of dealing with "old" data is to create time-based indices. Meaning, for each day or each week you create an index. Index everything belonging to that day/week in that index.
You decide how many days you want to keep around and stick to that number. Let's say that the data for 7 days counts as 10 GB. In the 8th day you create the new index, as usual, then you delete the index from 8 days before.
All the time you'll have in your cluster 7 indices.
Using ttl as the other poster suggested is not recommended, because is far more difficult and it creates additional pressure on the cluster. The ttl mechanism checks every indices.ttl.interval (60 seconds by default) for expired documents, it creates bulk requests out of them and deletes them. This means unnecessary requests coming to the cluster.
Instead, deleting an index is very easy and quick.
Take a look at this and how to easily manage time based indices with Curator.
From what I remember a capped collection in MongoDB was just a circular buffer type of collection that removes oldest entries when there's no more room? Unfortunately there's nothing like this out of the box in ElasticSearch, you have to add this functionality yourself either by removing single documents (or batches of documents) using ES's API. A more performant way is described in their documentation under retiring data.
You can provide a per index/type default _ttl(time to live) value as follows:
{
"tweet" : {
"_ttl" : { "enabled" : true, "default" : "1d" }
}
}
You will have more detail here: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-ttl-field.html
Regards,
Alain

Resources