elasticsearch Best approach to ingest real time data (tweets) - elasticsearch

Basically, my application has 2 types of traffic.
Real-time tweets injection (can have delay up to 1 min)
Tweets search from multiple users
I have 2 questions
what is the best approach to ingest this data to elasticsearch
What happens if I write tweets 1 at a time to elastic index in real-time? does it affect the "parallel search request"?

Index and searching are the two main operations in Elasticsearch and they have their own dedicated thread pools that work on these requests.
Coming to your questions
1. what is the best approach to ingest this data to elasticsearch?
You should not send these requests one by one and instead use the bulk API to ingest the data, which is recommended and more performant for such use-cases. Also total size of Bulk operation matters in bulk API not the no of operations. Dzone blog is a useful read on this.
2.What happens if I write tweets 1 at a time to elastic index in real-time? does it affect the "parallel search request"?
As mentioned that they have their own thread pools and if they are consumed you will see the issues in respective operation but there are various ways by which you can tune your indexing and search operations.

Related

How elastic search handles parallel index refresh requests?

In our project, we are hitting the elastic search's index refresh api after each create/update/delete operation for immediate search availability.
I want to know, how elastic search will perform if multiple parallel requests are made to its refresh api on single index having close to 2.5million documents?
any thoughts? suggestions?
Refresh is an operation where ElasticSearch asks Lucene shard to commit modification on disk and create a segment.
If you ask for a refresh after every operation you will create a huge number of micro-segments.
Too many segments make your search longer as your shard need to sequentially search through all of them in order to return a search result. Also, they consume hardware resources.
Each segment consumes file handles, memory, and CPU cycles. More important, every search request has to check every segment in turn; the more segments there are, the slower the search will be.
from the definitive guide
Lucene will merge those segments automatically into bigger segments, but that's also an I/O consuming task.
You can check this for more details
But from my knowledge, a refresh on a 2.5 billion documents index will take the same time in a 2.5k document index.
Also, it seems ( from this issue ) that refresh is a non-blocking operation.
But its a bad pattern for an elasticsearch cluster. Are every CUD operation of your application in need for a refresh ?

Elasticsearch single indexing performance

Is there any difference when indexing elasticsearch by batch data and by single data?
I want to use single indexing, but I don't know it's performance.
Bulk api should be used when ingesting large amounts of data.
There is a significant overhead to pay in terms of resource utilization/performance when using single index api (instead of bulk) to index a large amount of docs.
Using single index would consume time to load the huge amount of logs stored under the index name. If the payload is very high, the performance of elasticsearch would go down drastically, resulting in intermittent unavailability of data loading on the kibana dashboard.
So based on the volume of logs pushed to an index, we should try avoiding single index.

Elastic search runtime metrics

My question is more research related.
We have elastic search handling various tasks including taking log entries from remote clients. The problem is that there are times that the clients overload elastic search.
Is there a way to query ES to get runtime metrics like number of queries in last n minutes and so on. I'm hoping we can use these to throttle the client logging as load increases.
Data on number of search and get requests per second can be obtained by querying indices stats.
There are multiple tools that provide elasticsearch monitoring, most of them open-source. Having a look at their source code may be helpful.
Please also note that throttling requests client-side based on elasticsearch stats may not be optimal solution, as it is hard to coordinate with variable number of clients. Using circuit breakers that trigger on request timeouts may be more robust.
Also an option is to set a reverse proxy in front of elasticsearch. Moreoever, some problems related to many indexing requests can be solved by throttling IO for merge operations in elasticsearch itself, as is discussed here.
Try using LucidWorks SiLK instead - it uses Solr and that's more scalable. Download it from here: http://www.lucidworks.com/lucidworks-silk

elasticsearch vs hbase/hadoop for realtime statistics

I'm loggin millions of small log documents weekly to do:
ad hoc queries for data mining
joining, comparing, filtering and calculating values
many many fulltext-search with python
run this operations with all millions of docs, some times every day
My first thought was put all docs in HBase/HDFS and run Hadoop jobs generating stats results.
The problem is: some of results must be near real-time.
So, after some research I discovered ElasticSearch and Now I'm thinking about transfer all millions of documents and use DSL-Queries to generate stats results.
Is this a good idea? ElasticSearch seems to be so easy to handle with millions/billions of documents.
For real-time search Analytics Elastic Search is a good choice.
Definitely easier to setup and handle than Hadoop/HBase/HDFS.
Elastic-Search vs HBase Good Comparison: http://db-engines.com/en/system/Elasticsearch%3BHBase

ElasticSearch Performance : Continuous read/write vs Bulk write

I am new to Elastic Search. I need to implement a system where I will be getting data feed continuously throughout the day. I would like to make this data feed searchable so I am using ElasticSearch.
Now, I have two ways to go about this:
1) Store data from the feed in mongo. And feed this data to ElasticSearch at regular interval, let say twice a day.
2) Directly feed data to ElasticSearch which is s continuous process. At the same time ElasticSearch has to perform search queries.
I am expecting a volume of around 20 entries per second coming from data feed and around 2-3 queries per second being performed by ElasticSearch.
Please advice.
Can you tell us more about your cluster architecture? How many nodes? All nodes have data or also gateway nodes?
Usually I would say feeding directly to elasticsearch shouldn't be a problem. 2-3 queries per second is not much at all for elasticsearch.
You should optimize your index structure and application code for it:
Create separate index for each day
Increase number of shards (you
should experiment, based on your hardware configuration)
For old
days indexes you should close them or aggregate into big periods
(another month indexes) using some batch processing
from my tests 20 inserts/second is not a big load for elasticsearch

Resources