elastic search index strategies under high traffic - elasticsearch

We use ElasticSearch for our tool's real time metrics and analytics part. ElasticSearch is very cool and fast when we are query our data. (statiticial facets and terms facet)
But we have problem when we try to index our hourly data. We collect every our metric data from other services. First we collect data from other services and save them RabbitMQ process. But when queue worker runs our all hourly data not index to ES. Usually %40 of data index in ES and other them lost.
So what is your idea about when index ES under high traffic ?

I've posted answers to other similar questions:
Ways to improve first time indexing in ElasticSearch
Performance issues using Elasticsearch as a time window storage (latter part of my answer applies)
Additionally, instead of a custom 'queue worker' have you considered using a 'river'? For more information see:
http://www.elasticsearch.org/blog/the-river/
http://www.elasticsearch.org/guide/reference/river/

Related

How to design a system for Search query and Csv/Pdf export for 500GB data/day?

Problem Stmt
1 device is sending 500GB text data (logs) per day to my central server.
I want to design a system using which user can:
Apply exact-match filters and go through data using pagination
Export PDF/CSV reports for same query as above
Data can be stored for max 6 months. Its an on-premise solution. Some delay on queries is affordable. If we can do data compressions it would be awesome. I have 512GB RAM, 80core system and TBs of storage(these are upgradable)
What I have tried/found out:
Tech stack iam planning to use: MEAN stack for application dev. For core data part iam planning to use ELK stack. Elasticsearch single index can have <40-50gb ideal size recommendation.
So, my plan is create 100 indexes per day each of 5GB for each device. During query I can sort these indices based on their name (eg. 12_dec_2012_part_1 ...) and search into each index linearly and keep on doing this till the range user has asked. (I think this will hold good for ad-hoc request by user, but for reports if I do this and write to a csv file by going sequentially one by one it will take long time.) For reports I think best thing i can do is create pdf/csv for each index(5gb size), reason because most file openers cannot open very large csv/pdf files.
Iam new to big data problems. Iam not sure what approach is right; ELK or Hadoop ecosystem for this. (I would like to go with ELK)
Can someone point me to right direction or how to proceed or if someone has dealt with this type of problem statement? Any out of the way solution for these problems are also welcome.
Thanks!
exact-match filters
You can use term query or match_phrase query
Returns documents that contain an exact term in a provided field.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html
pagination
You can use from and size parameter to pagination.
GET /_search
{
"from": 5,
"size": 20,
"query": {
"match": {
"user.id": "kimchy"
}
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html
Export PDF/CSV
You can use Kibana
Kibana provides you with several options to share Discover saved
searches, dashboards, Visualize Library visualizations, and Canvas
workpads.
https://www.elastic.co/guide/en/kibana/current/reporting-getting-started.html
Data can be stored for max 6 months
You can use ILM policy
You can configure index lifecycle management (ILM) policies to
automatically manage indices according to your performance,
resiliency, and retention requirements. For example, you could use ILM
to:
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html
optimal shard size
For log indices you can use datastream indices.
A data stream lets you store append-only time series data across
multiple indices while giving you a single named resource for
requests. Data streams are well-suited for logs, events, metrics, and
other continuously generated data.
https://www.elastic.co/guide/en/elasticsearch/reference/current/data-streams.html
When you use datastream indices you don't think about shard size it
will rollover automatically. :)
For the compression you should update index settings
index.codec: best_compression
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html

Implements popular keyword in ElasticSearch

I'm using ElasticSearch on AWS EC2.
And i want to implement today's popular keyword function in ES.
there is 3 indexes(place, genre, name), and i want see today's popular keyword in name index only.
I tried to use ES slowlog and logstash. but slowlog save logs every shard's log.
(ex)number of shards : 5 then 5 query log saved.
Is there any good and easy way to implement popular keyword in ES?
As far as I know, this is not supported by Elasticsearch and you need to build your own custom solution.
Design you mentioned using the slowlog is not good as you mentioned its on per shard basis, even if you do some more computing and able to merge and relate them to a single search at index level, it would not be good, as
you have to change the slow log configuration and for every index there needs to be a different threshold, you can change it to 0ms, to make sure you get all the search queries in slow logs, but that would take a huge disk space and would not be good for Elasticsearch performance.
You have to do some parsing of slow log in your application and if you do it runtime it would be very costly.
I think you can maintain a distributed cache in your application where you store the top searched keyword like the leaderboard of a multi-player gaming app, which is changing very frequently but in your case, you don't even have to update this cache very frequently. I would not go into much implementation details, but simple Hashmap of search term as key and count as value would solve the issue.
Hope this helps. let me know if you have questions.

Elastic search API Vs Spring data Vs logstash

I am planing to use elastic search for our dashboard using spring boot based rest services. After research i see top 3 options
Option A:
Use Elastic Search Java API ( from comment looks like going to go away)
Use Elastic Search Java Rest Client
Use spring-data-elasticsearch ( planing to use es 5.6 but challenging for latest es 6 as I don't see it's supports right now)
Option B:
Or shall I use logstash approach to
Sync data between postgressql and elastic search using logstash ?
Which one among them will be long term approach to get near real time data from ES in high load scenario ??
Usecase: I need to save some data from postgresql table to elastic search for my dashboard (near real time )
Update is frequent for both tables and es
to maintain current state
Load is going to increase in couple of week
The options you listed, in essence, are: should you go with a ready to use solution (logstash) or should you implement your own.
Try logstash first to see if it works for you - it'll take less time than implementing your own solution, and you can get working solution in minutes (if it's not hundreds of tables)
If you want near-real time, then you need to figure out if it allows you to:
handle incremental updates, i.e. if its 'tracking_column' configuration will work for your data structure and it will only load updated records in each run, not the whole table.
run it at the desired frequency
and in general, satisfies your latency requirements
If you decide to go with your own solution, keep in mind that spring-data-elasticsearch is a higher level wrapper for underlying elasticsearch client. If there are latency goals, then working on the lower level (elasticsearch clients) may give you better control and more options to tune the pipeline.
Otherwise, the client choice will not matter that much as data feed features (volume/update frequency) and db/es cluster configuration.

Elasticseach & Kibana - best practice for visualizing one year's logs

I am using ElasticSearch with Kibana to store and visualize data from my logs. I know it is customary to use Logstash, but I just use the elasticsearch Rest API and POST new elements to it.
I am trying to look for best practices in terms of how I should manage my indices, given I have about 50k logs per day, and I want to visualize sometimes weekly, sometimes monthly and sometimes yearly data. And also I have no need for more than one node. I don't need a high available cluster.
So I am basically trying to determine:
-How should I store my indexes, by time? Monthly? Weekly? One index for everything?
-What are the disadvantages of a huge index (one index that contains all my data)? Does it mean that the entire index is in memory?
Thank you.
I like to match indexes to the data retention policy. Daily indexes work very well for log files, so you can expire one day's worth after X days of retention.
The fewer indexes/shards you have, the less RAM is used in overhead by Elasticsearch to manage them.
The mapping for a field is frozen when the field is added to the index. With a daily index, I can update the mapping and have it take effect for the new indexes, and wait for the old ones to expire. With a longer-term indexes, you'd probably need to reindex the data, which I always try to avoid.
The settings for shards and replicas are also frozen when you create the index.
You can visualize them in Kibana regardless of how they're stored. Use the #timestamp field as your X-axis and change the "interval" to the period you want.
Using logstash would be important if you wanted to alter your logs at all. We do a lot of normalization and creation of new fields, so it's very helpful. If it's not a requirement for you, you might also look into filebeats, which can write directly to elasticsearch.
Lots to consider...

couchbase data replication elasticsearch

I went through Couchbase xcdr replication documentation, but failed to understand below point:
1. couchbase replicate the all the data in bucket in batches to elstic search. And elastic search provide the indexing for these data for realtime statical data. My question is if all the data is replicated to elsastic search , then in this case elastic search is like database which can hold huge amount of data. So can we replace couchbase with elastic search?
2.how the data in form json is send to d3.js for display statical graph.
All of the data is replicated to Elastic Search, but is not held there by default. The indexes and such are created, but the documents are discarded. Elastic Search is not a database and does not perform like one and certainly not on the level of Couchbase. Take a look at this presentation where it talks about performance and stuff and why Cochbas
If your data are not critical or if you have another source of truth, you can use Elasticsearch only.
Otherwise, I'd keep Couchbase and Elasticsearch.
There is a resiliency page on Elastic.co website which describes potential known problems. https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html
My 2 cents.

Resources