Elastic search runtime metrics - elasticsearch

My question is more research related.
We have elastic search handling various tasks including taking log entries from remote clients. The problem is that there are times that the clients overload elastic search.
Is there a way to query ES to get runtime metrics like number of queries in last n minutes and so on. I'm hoping we can use these to throttle the client logging as load increases.

Data on number of search and get requests per second can be obtained by querying indices stats.
There are multiple tools that provide elasticsearch monitoring, most of them open-source. Having a look at their source code may be helpful.
Please also note that throttling requests client-side based on elasticsearch stats may not be optimal solution, as it is hard to coordinate with variable number of clients. Using circuit breakers that trigger on request timeouts may be more robust.
Also an option is to set a reverse proxy in front of elasticsearch. Moreoever, some problems related to many indexing requests can be solved by throttling IO for merge operations in elasticsearch itself, as is discussed here.

Try using LucidWorks SiLK instead - it uses Solr and that's more scalable. Download it from here: http://www.lucidworks.com/lucidworks-silk

Related

Measuring performance of CouchDb replication

I have several PCs in the network with application that uses CouchDB. CouchDb is configured to replicate data with CouchDb instance on all other nodes. I would like to measure a performance of data replication between the nodes. I tried to find if CouchDb exposes some data about the time spend replicating data, but I didn't find anything helpful except the endpoint _scheduler/jobs that can show information about how many documents waits for replicating and a sequence number.
My current idea is a very naive script that will query each CouchDb instance _scheduler/jobs edpoint frequently and based on numbers returned in changes_pending and docs_written fields I can somehow get some approximate estimate how long it takes to replicate data. This is, however, inaccurate and take me a moment to setup. Maybe you know some easier ways/tools that can help me?
Questions:
Are there any way to fetch information about time that replication of documents took in CouchDb?
Also, maybe you know some tools that can help me with measuring performance of CouchDb replication?
There was a performance improvement for document replication in the 3.3.0 release. The corresponding pull requests also documents how the replication performance was tested.
If you want to test replication performance yourself, I suggest you follow their setup using couchdyno.

what is a good indexing strategy when dealing with AWS/CloudWatch logs?

I am new to Elasticsearch world and I'm working on a project to use Amazon Elasticsearch service (Elasticsearch and Kibana) to provide a log analytics system for all the CloudWatch logs from different AWS accounts. Setting up the stack and routing the CloudWatch logs is the easy part. But I've noticed a good indexing strategy comes to play specially when you have immutable data in a time series fashion (logs in this case).
My first approach was to create one daily single index for each log group and use the Index Policy to move/expire old indices based on my requirements. but I figured that I am going to deal with a lot of tiny indices in my Elasticsearch cluster.
Then I considered to index all the CloudWatch log groups from each AWS account into a daily single index The problem is that it exceeds the mapping limit (1000 fields) mostly caused by CloudTrail and VPS flow logs and I think it is not a good idea to increase this limit.
So I've decided to group my logs into some limited number of index types (e.g. cloudtrail logs, VPC flow logs, and other logs). So basically I would have three daily indices for each AWS account which are relatively larger indices and I won't have to increase the mapping limit.
I'm sharing this to see if anybody els has implemented something similar and what are their thoughts. I'm still in the initial phase of the project and I am eagerly looking for suggestions and recommendations.
A good indexing strategy is very subjective and depends on lot of factors like size of each index and how ofter you are going to query it.
Since, here we are taking about cloudwatch logs, you should continue your focus on avoiding lots of smaller size indices. Apart from combining logs of different types, you can also look at combining older indices into weekly or monthly indices. For example, reindex one weeks data into a weekly index at the end of the week. Also, make sure you have a retention period defined and are clearing off any older indices.
You can also considering looking at UltraWarm nodes in Amazon Elasticsearch which provide hot-warm storage architecture which works really well for read only data like logs.

Is Elastic Search a good data store for a Read Only Api?

We are planning to create a reporting database exposed via read only api. It'll contain reporting related read apis for both our customers and internal processes like invoicing.
Also, we thought it will also be useful to have Kibana over it to have analytics for our internal teams.
Is Elastic Search good for this use case?
Yeah why not, Elasticsearch will be very good choice for your use-case due to following reasons:
You can de-normalize your data and store them in single index, this will make fetching and searching very fast, this is normally the prime usecases of nosql and ES can work like that.
Basic x-pack security is available free in ES, which would provide read only access to your users without much effort and cost.
Apart from search, Elasticsearch is again very popular for analytics use-cases, you can run very aggregations easily for your use-cases and can use Kibana dashboard for visualisation, which has very nice integration with ES as both are same company(Elastic) products.
And most importantly ES is horizontally scalable and distributed system and easily be scaled to hundreds of nodes to support anyone's growing needs.
In addition to opster's answer there are 2 things that I want to mention that might help you in making a decision :
How E.S is serving us for a real-time reporting use case in production with an extensive data set
Performance of reporting in E.S vs Mongo (that we measured)
How E.S is serving for a real-time reporting use case in production
with an extensive data set
E.S provides real-time results (under 1 sec) for below cases of ours:
Reports generated by running multiple set of filters (date, etc) & aggregation on millions of data points
Time based reports (grouping data by day, week, month, quarter, year) - Powered by DateHistogram
Performance of reporting in E.S vs Mongo (that we measured)
Aggregating 5 million data points in E.S took < 1 sec while it took Mongo > 10 sec, on similar instances.
In addition to above: Support for scripting is also available, which provided a lot of flexibility.

elasticsearch - limit number of search request

When elasticsearch try to search request more than one, sometime it's much slow than search one by one. Therefore I would like to elasticsearch search request work one by one. How should I do this?
This indicates a problem with your data model, queries, or cluster configuration. It is not normal or expected for Elasticsearch to be much slower with two concurrent queries than executing those two queries in sequence. You really, really should investigate the underlying problem (start by looking at your logs, if you haven't already). However, to answer the question, you can accomplish this by updating the search thread pool size to 1 (and perhaps increasing the queue_size to compensate).
I want to stress though, that it's really not a good idea to mess with these settings except for advanced use cases (in case you have very imbalanced use between index and search requests, for example).

elasticsearch vs hbase/hadoop for realtime statistics

I'm loggin millions of small log documents weekly to do:
ad hoc queries for data mining
joining, comparing, filtering and calculating values
many many fulltext-search with python
run this operations with all millions of docs, some times every day
My first thought was put all docs in HBase/HDFS and run Hadoop jobs generating stats results.
The problem is: some of results must be near real-time.
So, after some research I discovered ElasticSearch and Now I'm thinking about transfer all millions of documents and use DSL-Queries to generate stats results.
Is this a good idea? ElasticSearch seems to be so easy to handle with millions/billions of documents.
For real-time search Analytics Elastic Search is a good choice.
Definitely easier to setup and handle than Hadoop/HBase/HDFS.
Elastic-Search vs HBase Good Comparison: http://db-engines.com/en/system/Elasticsearch%3BHBase

Resources