What does Elasticsearch automatic slicing do? - elasticsearch

What does Elasticsearch automatic slicing do? I find the documentation to be very laconic about this function. I tried searching for other explanations of this functionality, but to no avail. Neither I have managed to find what slice is in Elasticsearch.

Automatic slicing is a way to parallelize work for a few different endpoints, such as reindex, update by query and delete by query.
The three above APIs all work the same way by making a scroll query over the target index. Scroll queries provide a more performant way of making queries yielding big result sets than normal paged queries. Scroll queries can be further improved by slicing them.
In clear, if a query is supposed to return a big amount of hits, you can make a normal query and page through results using from/size, but that will not be performant because of deep-paging. To circumvent that issue, ES allows you to use scroll queries in order to get results in batches of N hits. Those scroll queries can further be improved by slicing them, i.e. split the scroll in multiple slices which can be consumed independently by your client application.
So, say you have a query which is supposed to return 1,000,000 hits, and you want to scroll over that result set in batches of 50,000 hits, using a normal scroll query (i.e. without slicing), your client application will have to make the first scroll call and then 20 more synchronous calls (i.e. one after another) to retrieve each batch of 50K hits.
By using slicing, you can parallelize the 20 scroll calls. If your client application is multi-threaded, you can make each scroll call use 5 (e.g.) slices, and thus, you'll end up with 5 slices of ~10K hits that can be consumed by 5 different threads in your application, instead of having a single thread consume 50K hits. You can thus leverage the full computing power of your client application to consume those hits.
The ideal number of slices should be a multiple of the number of shards in the source index. For the best performance, you should pick the same number of slices as there are shards in your source index. For that reason, you might want to use automatic slicing instead of manual slicing, as ES will pick that number for you.

Related

CosmosDB - Gremlin - high memory usagage with query containing limit() step

I want to retrive a large amount of items but using limit clause:
g.V().hasLabel('foo').as('f').limit(5000).order().by('f_Id',incr).by('f_bar',incr).select('f').unfold().dedup()
This query takes very long time and consumes about 800 MB memory to download the collection
Whan i use below query:
g.V().hasLabel('foo').as('f').has('propA','ValueA').has('propB','ABC').limit(5000).order().by('f_Id',incr).by('f_bar',incr).select('f').unfold().dedup()
it is faster and consumes less memory around 500 MB to download this collection but still high.
My Qestion is how to optimize the first query with just limit if i do not want to filter by Properties A and B.
Second Question why there is such difference in memory size between those two results? In both queries i download 5000 items to memory. What could be possible way to reduce this consumption.
I use GremlinDriver for .Net.
I'm not expert at CosmosDB optimization but from a Gremlin perspective when I look at this traversal:
g.V().hasLabel('foo').as('f').
limit(5000).order().by('f_Id',incr).by('f_bar',incr).
select('f').unfold().dedup()
I wonder why you wouldn't just write it as:
g.V().hasLabel('foo').limit(5000).order().by('f_Id',incr).by('f_bar',incr)
Meaning, you want 5000 "foo" vertices ordered a certain way. The need to use the "f" step label and unfold() seem unnecessary and I don't see how you could end up with duplicates so you can drop dedup(). I'm not sure if those changes will make any difference to how CosmosDB processes things but it certainly removes some unneeded processing.
I'd also wonder if you need to pair down the data returned in your vertices. Right now you're returning all the properties for each vertex. If you don't need all of those it might be better to be more specific and transform the data to the form your application requires:
g.V().hasLabel('foo').limit(5000).order().by('f_Id',incr).by('f_bar',incr).
valueMap('name','age')
That should help reduce serialization costs.

ElasticSearch 6.0.1: .../_forcemerge API does not scale with more threads (is throttled?)

I am using ES 6.0.1 and have to do frequent index open/append/close patterns over many indexes, typically from different clients in parallel. (Yes, I have to open and close every time)
This leads to high number of small Lucene segments per index, and the mentioned sequence becomes slower over time (2-5 times slower sometimes). Default ES segment merging strategy apparently is not doing a great job.
When I use Force Merge API to merge segments within index, the performance of my sequence returns back to normal for processed indices. However, due to large number of indices, I have to apply it many times to handle all indices. Naturally, I run it in multiple threads (connections), but it seems like ES never parallels this operation and the resulting rate of merge is the same, no matter how many parallel requests I make.
I've read and tried things from here but this didn't help.
Could someone kindly suggest any w/a for that?
You can change size of force_merge thread pool via elasticsearch config file, for example:
thread_pool.force_merge.size: 5
And don't forget to restart Elasticsearch after the config change.

Increase Solr performance when querying a subset of documents

The Usecase
I have an index of potentially millions of documents. I want to make around 20'0000 searches on a subset of these documents (around 25'000 documents). These 25'000 documents could take up around 100 MB stored in Solr (consisting of stored and indexes text fields).
The Problem
As the number of indexed documents increases, the performance of the queries decreases a lot. For example running 20'000 searches that hit 25'000 documents on 100'000 document index takes around 4 minutes. Running the same searches on 200'000 document index takes around 20 minutes.
So is there any way to cache these 25'000 documents in RAM before hitting them with searches?
UPDATE
Some things that really helped:
reducing returned row count (In almost all cases I had to iterate through returned results and in almost all cases where were no more than 100 matching results, but I had set rows to a very large value. Reducing the row count improved the performance around 2x. This seemed counter intuitive. If there are only 79 matches and I set returned row count to 100 it performs better than in a case when where are 79 matches and I set the row count to 1000. In the first case Solr already returns found item count and does it fast. Why should there be a performance difference?)
reducing multithreading (I had added multiple threads for querying because on the development box there were more resources available. On the resource constrained production box it was slowing things down. Using only one or two threads got me around 2x speed improvement.)
Some things that did not really help:
splitting up field queries (I was already using field queries everywhere it was possible, but I was combining them in one fq for each query fq=name:a AND type:b. Splitting them up with fq=name:a&fq=type:b caches them separately (see Apache Solr documentation) and could improve performance. But it did not make a huge difference in this case.
changing caching settings in this case filterCache seemed to have the most potential. However, increasing it or changing its settings did not make a huge difference.
A few things that are recommended for performance:
Have enough spare RAM on the box so index files can be in OS cache
Try to play around with solr caching settings in SolrConfig
Play around with autowarming after commits
Try to develop your queries to limit the result set. Large result sets, specifically if using grouping and faceting will kill performance. Now 200,000 document index is really quite small, so you should not have any problems, but I thought I'd mention this for when you scale.
Try to use Filter query (FQ) whenever possible. They are much faster than doing field:val in q, plus they are cached.

Performance issues using Elasticsearch as a time window storage

We are using elastic search almost as a cache, storing documents found in a time window. We continuously insert a lot of documents of different sizes and then we search in the ES using text queries combined with a date filter so the current thread does not get documents it has already seen. Something like this:
"((word1 AND word 2) OR (word3 AND word4)) AND insertedDate > 1389000"
We maintain the data in the elastic search for 30 minutes, using the TTL feature. Today we have at least 3 machines inserting new documents in bulk requests every minute for each machine and searching using queries like the one above pratically continuously.
We are having a lot of trouble indexing and retrieving these documents, we are not getting a good throughput volume of documents being indexed and returned by ES. We can't get even 200 documents indexed per second.
We believe the problem lies in the simultaneous queries, inserts and TTL deletes. We don't need to keep old data in elastic, we just need a small time window of documents indexed in elastic at a given time.
What should we do to improve our performance?
Thanks in advance
Machine type:
An Amazon EC2 medium instance (3.7 GB of RAM)
Additional information:
The code used to build the index is something like this:
https://gist.github.com/dggc/6523411
Our elasticsearch.json configuration file:
https://gist.github.com/dggc/6523421
EDIT
Sorry about the long delay to give you guys some feedback. Things were kind of hectic here at our company, and I chose to wait for calmer times to give a more detailed account of how we solved our issue. We still have to do some benchmarks to measure the actual improvements, but the point is that we solved the issue :)
First of all, I believe the indexing performance issues were caused by a usage error on out part. As I told before, we used Elasticsearch as a sort of a cache, to look for documents inside a 30 minutes time window. We looked for documents in elasticsearch whose content matched some query, and whose insert date was within some range. Elastic would then return us the full document json (which had a whole lot of data, besides the indexed content). Our configuration had elastic indexing the document json field by mistake (besides the content and insertDate fields), which we believe was the main cause of the indexing performance issues.
However, we also did a number of modifications, as suggested by the answers here, which we believe also improved the performance:
We now do not use the TTL feature, and instead use two "rolling indexes" under a common alias. When an index gets old, we create a new one, assign the alias to it, and delete the old one.
Our application does a huge number of queries per second. We believe this hits elastic hard, and degrades the indexing performance (since we only use one node for elastic search). We were using 10 shards for the node, which caused each query we fired to elastic to be translated into 10 queries, one for each shard. Since we can discard the data in elastic at any moment (thus making changes in the number of shards not a problem to us), we just changed the number of shards to 1, greatly reducing the number of queries in our elastic node.
We had 9 mappings in our index, and each query would be fired to a specific mapping. Of those 9 mappings, about 90% of the documents inserted went to two of those mappings. We created a separate rolling index for each of those mappings, and left the other 7 in the same index.
Not really a modification, but we installed SPM (Scalable Performance Monitoring) from Sematext, which allowed us to closely monitor elastic search and learn important metrics, such as the number of queries fired -> sematext.com/spm/index.html
Our usage numbers are relatively small. We have about 100 documents/second arriving which have to be indexed, with peaks of 400 documents/second. As for searches, we have about 1500 searches per minute (15000 before changing the number of shards). Before those modifications, we were hitting those performance issues, but not anymore.
TTL to time-series based indexes
You should consider using time-series-based indexes rather than the TTL feature. Given that you only care about the most recent 30 minute window of documents, create a new index for every 30 minutes using a date/time based naming convention: ie. docs-201309120000, docs-201309120030, docs-201309120100, docs-201309120130, etc. (Note the 30 minute increments in the naming convention.)
Using Elasticsearch's index aliasing feature (http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases/), you can alias docs to the most recently created index so that when you are bulk indexing, you always use the alias docs, but they'll get written to docs-201309120130, for example.
When querying, you would filter on a datetime field to ensure only the most recent 30 mins of documents are returned, and you'd need to query against the 2 most recently created indexes to ensure you get your full 30 minutes of documents - you could create another alias here to point to the two indexes, or just query against the two index names directly.
With this model, you don't have the overhead of TTL usage, and you can just delete the old, unused indexes from over an hour in the past.
There are other ways to improve bulk indexing and querying speed as well, but I think removal of TTL is going to be the biggest win - plus, your indexes only have a limited amount of data to filter/query against, which should provide a nice speed boost.
Elasticsearch settings (eg. memory, etc.)
Here are some setting that I commonly adjust for servers running ES - http://pastebin.com/mNUGQCLY, note that it's only for a 1GB VPS, so you'll need to adjust.
Node roles
Looking into master vs data vs 'client' ES node types might help you as well - http://www.elasticsearch.org/guide/reference/modules/node/
Indexing settings
When doing bulk inserts, consider modifying the values of both index.refresh_interval index.merge.policy.merge_factor - I see that you've modified refresh_interval to 5s, but consider setting it to -1 before the bulk indexing operation, and then back to your desired interval. Or, consider just doing a manual _refresh API hit after your bulk operation is done, particularly if you're only doing bulk inserts every minute - it's a controlled environment in that case.
With index.merge.policy.merge_factor, setting it to a higher value reduces the amount of segment merging ES does in the background, then back to its default after the bulk operation restores normal behaviour. A setting of 30 is commonly recommended for bulk inserts and the default value is 10.
Some other ways to improve Elasticsearch performance:
increase index refresh interval. Going from 1 second to 10 or 30 seconds can make a big difference in performance.
throttle merging if it's being overly aggressive. You can also reduce the number of concurrent merges by lowering index.merge.policy.max_merge_at_once and index.merge.policy.max_merge_at_once_explicit. Lowering the index.merge.scheduler.max_thread_count can help as well
It's good to see you are using SPM. Its URL in your EDIT was not hyperlink - it's at http://sematext.com/spm . "Indexing" graphs will show how changing of the merge-related settings affects performance.
I would fire up an additional ES instance and have it form a cluster with your current node. Then I would split the work between the two machines, use one for indexing and the other for querying. See how that works out for you. You might need to scale out even more for your specific usage patterns.

max number of couchbase views per bucket

How many views per bucket is too much, assuming a large amount of data in the bucket (>100GB, >100M documents, >12 document types), and assuming each view applies only to one document type? Or asked another way, at what point should some document types be split into separate buckets to save on the overhead of processing all views on all document types?
I am having a hard time deciding how to split my data into couchbase buckets, and the performance implications of the views required on the data. My data consists of more than a dozen relational DBs, with at least half with hundreds of millions of rows in a number of tables.
The http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-writing-bestpractice.html doc section "using document types" seems to imply having multiple document types in the same bucket is not ideal because views on specific document types are updated for all documents, even those that will never match the view. Indeed, it suggests separating data into buckets to avoid this overhead.
Yet there is a limit of 10 buckets per cluster for performance reasons. My only conclusion therefore is that each cluster can handle a maximum of 10 large collections of documents efficiently. Is this accurate?
Tug's advice was right on and allow me to add some perspective as well.
A bucket can be considered most closely related to (though not exactly) a "database instantiation" within the RDMS world. There will be multiple tables/schemas within that "database" and those can all be combined within a bucket.
Think about a bucket as a logical grouping of data that all shares some common configuration parameters (RAM quota, replica count, etc) and you should only need to split your data into multiple buckets when you need certain datasets to be controlled separately. Other reasons are related to very different workloads to different datasets or the desire to be able to track the workload to those datasets separately.
Some examples:
-I want to control the caching behavior for one set of data differently than another. For instance, many customers have a "session" bucket that they want always in RAM whereas they may have a larger, "user profile" bucket that doesn't need all the data cached in RAM. Technically these two data sets could reside in one bucket and allow Couchbase to be intelligent about which data to keep in RAM, but you don't have as much guarantee or control that the session data won't get pushed out...so putting it in its own bucket allows you to enforce that. It also gives you the added benefit of being able to monitor that traffic separately.
-I want some data to be replicated more times than others. While we generally recommend only one replica in most clusters, there are times when our users choose certain datasets that they want replicated an extra time. This can be controlled via separate buckets.
-Along the same lines, I only want some data to be replicated to another cluster/datacenter. This is also controlled per-bucket and so that data could be split to a separate bucket.
-When you have fairly extreme differences in workload (especially around the amount of writes) to a given dataset, it does begin to make sense from a view/index perspective to separate the data into a separate bucket. I mention this because it's true, but I also want to be clear that it is not the common case. You should use this approach after you identify a problem, not before because you think you might.
Regarding this last point, yes every write to a bucket will be picked up by the indexing engine but by using document types within the JSON, you can abort the processing for a given document very quickly and it really shouldn't have a detrimental impact to have lots of data coming in that doesn't apply to certain views. If you don't mind, I'm particularly curious at which parts of the documentation imply otherwise since that certainly wasn't our intention.
So in general, we see most deployments with a low number of buckets (2-3) and only a few upwards of 5. Our limit of 10 comes from some known CPU and disk IO overhead of our internal tracking of statistics (the load or lack thereof on a bucket doesn't matter here). We certainly plan to reduce this overhead with future releases, but that still wouldn't change our recommendation of only having a few buckets. The advantages of being able to combine multiple "schemas" into a single logical grouping and apply view/indexes across that still exist regardless.
We are in the process right now of coming up with much more specific guidelines and sizing recommendations (I wrote those first two blogs as a stop-gap until we do).
As an initial approach, you want to try and keep the number of design documents around 4 because by default we process up to 4 in parallel. You can increase this number, but that should be matched by increased CPU and disk IO capacity. You'll then want to keep the number of views within each document relatively low, probably well below 10, since they are each processed in serial.
I recently worked with one user who had an fairly large amount of views (around 8 design documents and some dd's with nearly 20 views) and we were able to drastically bring this down by combining multiple views into one. Obviously it's very application dependent, but you should try to generate multiple different "queries" off of one index. Using reductions, key-prefixing (within the views), and collation, all combined with different range and grouping queries can make a single index that may appear crowded at first, but is actually very flexible.
The less design documents and views you have, the less disk space, IO and CPU resources you will need. There's never going to be a magic bullet or hard-and-fast guideline number unfortunately. In the end, YMMV and testing on your own dataset is better than any multi-page response I can write ;-)
Hope that helps, please don't hesitate to reach out to us directly if you have specific questions about your specific use case that you don't want published.
Perry
As you can see from the Couchbase documentation, it is not really possible to provide a "universal" rules to give you an exact member.
But based on the best practice document that you have used and some discussion(here) you should be able to design your database/views properly.
Let's start with the last question:
YES the reason why Couchbase advice to have a small number of bucket is for performance - and more importantly resources consumption- reason. I am inviting you to read these blog posts that help to understand what's going on "inside" Couchbase:
Sizing 1: http://blog.couchbase.com/how-many-nodes-part-1-introduction-sizing-couchbase-server-20-cluster
Sizing 2: http://blog.couchbase.com/how-many-nodes-part-2-sizing-couchbase-server-20-cluster
Compaction: http://blog.couchbase.com/compaction-magic-couchbase-server-20
So you will see that most of the "operations" are done by bucket.
So let's now look at the original question:
yes most the time your will organize the design document/and views by type of document.
It is NOT a problem to have all the document "types" in a single(few) buckets, this is in fact the way your work with Couchbase
The most important part to look is, the size of your doc (to see how "long" will be the parsing of the JSON) and how often the document will be created/updated, and also deleted, since the JS code of the view is ONLY executed when you create/change the document.
So what you should do:
1 single bucket
how many design documents? (how many types do you have?)
how any views in each document you will have?
In fact the most expensive part is not during the indexing or quering it is more when you have to rebalance the data and indices between nodes (add, remove , failure of nodes)
Finally, but it looks like you already know it, this chapter is quite good to understand how views works (how the index is created and used):
http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-operation.html
Do not hesitate to add more information if needed.

Resources