Elasticsearch single indexing performance - elasticsearch

Is there any difference when indexing elasticsearch by batch data and by single data?
I want to use single indexing, but I don't know it's performance.

Bulk api should be used when ingesting large amounts of data.
There is a significant overhead to pay in terms of resource utilization/performance when using single index api (instead of bulk) to index a large amount of docs.

Using single index would consume time to load the huge amount of logs stored under the index name. If the payload is very high, the performance of elasticsearch would go down drastically, resulting in intermittent unavailability of data loading on the kibana dashboard.
So based on the volume of logs pushed to an index, we should try avoiding single index.

Related

elasticsearch Best approach to ingest real time data (tweets)

Basically, my application has 2 types of traffic.
Real-time tweets injection (can have delay up to 1 min)
Tweets search from multiple users
I have 2 questions
what is the best approach to ingest this data to elasticsearch
What happens if I write tweets 1 at a time to elastic index in real-time? does it affect the "parallel search request"?
Index and searching are the two main operations in Elasticsearch and they have their own dedicated thread pools that work on these requests.
Coming to your questions
1. what is the best approach to ingest this data to elasticsearch?
You should not send these requests one by one and instead use the bulk API to ingest the data, which is recommended and more performant for such use-cases. Also total size of Bulk operation matters in bulk API not the no of operations. Dzone blog is a useful read on this.
2.What happens if I write tweets 1 at a time to elastic index in real-time? does it affect the "parallel search request"?
As mentioned that they have their own thread pools and if they are consumed you will see the issues in respective operation but there are various ways by which you can tune your indexing and search operations.

How elastic search handles parallel index refresh requests?

In our project, we are hitting the elastic search's index refresh api after each create/update/delete operation for immediate search availability.
I want to know, how elastic search will perform if multiple parallel requests are made to its refresh api on single index having close to 2.5million documents?
any thoughts? suggestions?
Refresh is an operation where ElasticSearch asks Lucene shard to commit modification on disk and create a segment.
If you ask for a refresh after every operation you will create a huge number of micro-segments.
Too many segments make your search longer as your shard need to sequentially search through all of them in order to return a search result. Also, they consume hardware resources.
Each segment consumes file handles, memory, and CPU cycles. More important, every search request has to check every segment in turn; the more segments there are, the slower the search will be.
from the definitive guide
Lucene will merge those segments automatically into bigger segments, but that's also an I/O consuming task.
You can check this for more details
But from my knowledge, a refresh on a 2.5 billion documents index will take the same time in a 2.5k document index.
Also, it seems ( from this issue ) that refresh is a non-blocking operation.
But its a bad pattern for an elasticsearch cluster. Are every CUD operation of your application in need for a refresh ?

Efficient way to search and sort data with elasticsearch as a datastore

We are using elasticsearch as a primary data store to save data and our indexing strategy is time based(for example, we create an index every 6 hours - configurable). The search-sort queries that come to our application contain time range; and based on input time range we calculate the indices need to be used for searching data.
Now, if the input time range is large - let's say 6 months, and we delegate the search-sort query to elasticsearch then elasticsearch will load all the documents into memory which could drastically increase the heap size(we have a limitation on the heap size).
One way to deal with the above problem is to get the data index by index and sort the data in our application ; indices are opened/closed accordignly; for example, only latest 4 indices are opened all the time and remaining indices are opened/closed based on the need. I'm wondering if there is any better way to handle the problem in hand.
UPDATE
Option 1
Instead of opening and closing indexes you could experiment with limiting the field data cache size.
You could limit the field data cache to a percentage of the JVM heap size or a specific size, for example 10Gb. Once field data is loaded into the cache it is not removed unless you specifically limit the cache size. Putting a limit will evict the oldest data in the cache and so avoid an OutOfMemoryException.
You might not get great performance but then it might not be worse than opening and closing indexes and would remove a lot of complexity.
Take into account that Elasticsearch loads all of the documents in the index when it performs a sort so that means whatever limit you put should be big enough to load that index into memory.
See limiting field data cache size
Option 2
Doc Values
This means writing necessary meta data to disk at index time, so that means the "fielddata" required for sorting lives on disk and not in memory. It is not a huge amount slower than using in memory fielddata and in fact can alleviate problems with garbage collection as less data is loaded into memory. There are some limitations such as string fields needing to be not_analyzed.
You could use a mixed approach and enable doc values on your older indexes and use faster and more flexible fielddata on current indexes (if you could classify your indexes in that way). That way you don't penalize the queries on "active" data.
See Doc Values documentation

ElasticSearch Performance : Continuous read/write vs Bulk write

I am new to Elastic Search. I need to implement a system where I will be getting data feed continuously throughout the day. I would like to make this data feed searchable so I am using ElasticSearch.
Now, I have two ways to go about this:
1) Store data from the feed in mongo. And feed this data to ElasticSearch at regular interval, let say twice a day.
2) Directly feed data to ElasticSearch which is s continuous process. At the same time ElasticSearch has to perform search queries.
I am expecting a volume of around 20 entries per second coming from data feed and around 2-3 queries per second being performed by ElasticSearch.
Please advice.
Can you tell us more about your cluster architecture? How many nodes? All nodes have data or also gateway nodes?
Usually I would say feeding directly to elasticsearch shouldn't be a problem. 2-3 queries per second is not much at all for elasticsearch.
You should optimize your index structure and application code for it:
Create separate index for each day
Increase number of shards (you
should experiment, based on your hardware configuration)
For old
days indexes you should close them or aggregate into big periods
(another month indexes) using some batch processing
from my tests 20 inserts/second is not a big load for elasticsearch

Should I keep the size of stored fields in Solr to a minimum?

I am looking to introduce Solr to power the search for a business listing website. The site has around 2 million records.
There is a search results page which will display some key data for each result. I believe the data needed for this summary information is around 1KB per result.
I could simply index the fields needed for the search within Solr - but this means a separate database call for each result to populate the summary information. If Solr could return all of this data I would expect it to yield greater performance than ~40 database round-trips.
The concern is that Solr's memory usage would be too large (how might I calculate this?) and that indexing might take too long with the extra data.
You would benefit greatly to store those fields in Solr compared to the 40 db roundtrips. Just make sure that you marked the field as "not indexed" (indexed = false) in your schema config and maybe also compressed (compressed = true) (however this will of course use some CPU when indexing and retrieving).
When marking a field as "not indexed" no analyzers will process the field when indexing making it stored much faster than a indexed field.
It's a trade off, and you will have to analyze this yourself.
Solr's performance greatly depends on caching, not only of queries, but also of the documents themselves. Those caches depend on memory, and the bigger your documents are, the less you can fit in a fixed amount of memory.
Document size also affects index size and replication times. For large indices with master slave configurations, this can impact the rate at which you can update the index.
Ideally you should measure cache hit rates at different cache sizes, with and without the fields. If you can spend the memory to get a high enough cache hit rate with the fields, then by all means go for it. If you cannot, you may have to fetch the document content from another system.
There is a third alternative you didn't mention, which is to store the documents outside of the DB, but not in Solr. They should be stored in a format which is as close as possible to what you deliver with search results. The code which creates/updates the indices could create/update these documents as well. This is a lot of work, but like everything it comes down to how much performance you need and what you are willing to do to get it.
EDIT: For measuring cache hit rates and throughput, I've found the best test source is your current query logs. Take a day or two worth of live queries and run them against different indexes and configurations to see how well they work.

Resources