optimize elastic search performance - elasticsearch

I am trying to benchmark my elastic search setup by posting documents against a large schema. The two variations of schema are:
indexing enabled for each attribute.
indexing disabled for all the attributes.
My benchmark consists of only going to elastichq cluster and checking spikes in CPU.
However, I don't see the CPU spikes dropping when using the option 2.
Question: Disabling indexing should result in a better performance?
Setup:
Running elastic search on a docker with 1 shard and 1 replica for the index.
Schema with index enabled: https://pastebin.com/uXFkCCzY
Schema with index disabled: https://pastebin.com/FGSAFTMT
Document:
{
"status": "open",
"created_at": "2022-02-14",
"long_12": 123456789,
"division": {
"prop_1": 112211,
"prop_2": false,
"currency": "a brief text"
},
"emails":{
"email": "abc#gmail.com"
}
}
Load test scenario: created 10 Java threads running on i7 laptop and each thread posted 100000 documents with some modification (to keep the document distinct status field value was randomly generated).
More detail on why I am doing this:
So, my Production Elasticsearch (ES) cluster is performing very bad with Read going upwards of 10 second. And apart from all the necessary Read optimization I can do; I am also noticing that ES cluster is generally very busy. And I noticed that my ES index schema doesn't have indexing disabled for any attribute (and we have around 350 attributes).
So, my expectation was that if I set indexing disabled for unnecessary attributes, I can get some wins. However, that's not happening.
Can you please shed some light on:
Does setting index: false and enabled: false should have improved performance.
Am I disabling the index on attributes the right way.
Is my benchmarking technique right
NOTE Document and schema are for reference purpose only the actual schema and document in PROD is quite large. And the result was consistent when benchmarked using a large document.

Related

A Range Search Query is causing Garbage Collection in Elastic Search

I have an Elastic Search 5.2 cluster with 16 nodes (13 data nodes/3 master/24 GB RAM/12 GB Heap). I am performance testing a query and making 50 calls of a search query per second on the Elastic cluster. My query looks like the following -
{
"query": {
"bool": {
"must": [
{
"term": {
"cust_id": "AC-90-464690064500"
}
},
{
"range": {
"yy_mo_no": {
"gt": 201701,
"lte": 201710
}
}
}
]
}
}
}
My index mapping is like the following -
cust_id Keyword
smry_amt Long
yy_mo_no Integer // doc_values enabled
mkt_id Keyword
. . .
. . .
currency_cd Keyword // Total 10 field with 8 Keyword type
The index contains 200 million records and for each cust_id, there may be 100s of records. Index has 2 Replicas. The record size is under 100 bytes.
When I run the performance test for 10 minutes, the query response and performance seems to be very slow. Upon investigating a bit more in details in Kibana monitoring tab, It appears that there is a lot of Garbage Collection activity happening (pls. see Image below) -
I have several question clouding in my mind. I did some research on Range queries but didn't find much on what can cause GC activity in scenarios similar to mine. I also research on Memory usage and GC activity, but most of Elastic documentation refers that young generation GC is normal while Indexing, while search activity mostly use the file system cache that OS maintains. Thats why, in the chart above, Heap is not much used since Search was using File System cache.
So -
What might be causing the garbage collection to happen here ?
The chart shows that the Heap is still available to Elastic Search, and Used Heap is still very less as compared to available. Then what is triggering GC ?
Is the query type causing any internal data structure to be created that is getting disposed off, causing GC ?
The CPU spike may be due to GC activity.
Is there any other efficient way of running the Range query in Elastic Search pre 5.5 versions ?
Profiling the query tells that Elastic is running a TermQuery and a BooleanQuery with the later is costing the most.
Any idea whats going on here ?
Thanks in Advance,
SGSI.
The correct answer depends on index settings but I guess you are using integer type with enabled docValues. This data structure is supposed to support aggregations and sorting but not range queries. The right data type is range.
In case of DocValues elastic/lucene iterates over ALL documents(i.e. full scan) in order to match range query - this require to read and decode every value from DV column - this operation is quite expensive, especially when the index can not be cached by the operating system.

Something "Materialized view"-like in ElasticSearch

I have a query which runs every time a website is loaded. This Query aggregates over three different term-fields and around 3 million documents and therefore needs 6-7 seconds to complete. The data does not change that frequently and the currentness of the result is not critical.
I know that I can use an alias to create something "View" like in the RDMS world. Is it also possible to populate it, so the query result gets cached? Is there any other way caching might help in this scenario or do I have to create an additional index for the aggregated data and update it from time to time?
I know that the post is old, but about view, elastic add the Data frames in the 7.3.0.
You could also use the _reindex api
POST /_reindex
{
"source": {
"index": "live_index"
},
"dest": {
"index": "caching_index"
}
}
But it will not change your ingestion problem.
About this, I think the solution is sharding for your index.
with 2 or more shards, and several nodes, elastic will be able to paralyze.
But an easier thing to test is to disable the refresh_interval when indexing and to re-enable it after. It generally improve a lot the ingestion time.
You can see a full article on this use case on
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html
You create materialised view.Its a table eventually which has data of aggregated functions. As you have already inserted the aggregated data ,now when you query it, it will be faster. I feel there is no need to cache as well.Even i have created the MVs , it improves the performance tremendously. Having said that you can even go for elastic search as well where you can cache the aggregated queries if your data is not changing frequently.I feel MV and elastic search gives the same performance.

Elastic search preference set to custom value, document still returned from different shards

I'm having issue with scoring: when I run the same query multiple times, each documents are not scored the same way. I found out that the problem is well known, it's the bouncing result issue.
A bit of context: I have multiple shards across multiple nodes (60 shards, 10 data nodes), all the nodes are using ES 2.3 and we're heavily using nested document - the example query doesn't use them, for simplicity.
I tried to resolve it by using the preference search parameter, with a custom value. The documentation states:
A custom value will be used to guarantee that the same shards will be used for the same custom value. This can help with "jumping values" when hitting different shards in different refresh states. A sample value can be something like the web session id, or the user name.
However, when I run this query multiple times:
GET myindex/_search?preference=asfd
{
"query": {
"term": {
"has_account": {
"value": "twitter"
}
}
}
}
I end up having the same documents, but with different scoring/sorting. If I enable explain, I can see that those documents are coming from different shards.
If I use preference=_primary or preference=_replica, we have the expected behavior (always the same shard, always the same scoring/sorting) but I can't query only one or the other...
I also experimented with search_type=dfs_search_then_fetch, which should generate the scoring based on the whole index, across all shards, but I still get different scoring for each run of the query.
So in short, how do I ensure the score and the sorting of the results of a query stay the same during a user's session?
Looks like my replicas went out of sync with the primaries.
No idea why, but deleting the replicas and recreating them have "fixed" the problem... I'll need some investigations on why it went out of sync
Edit 21/10/2016
Regarding the "preference" option not being taken into account, it's linked to the AWS zone awareness: if the preferred replica is in another zone than the client node, then the preference will be ignored.
The differences between the replicas are "normal" if you delete (or update) documents, from my understanding the deleted document count will vary between the replicas, since they're not necessarily merging segments at the same time.

Elasticsearch significant terms aggregation

I've started using the significant terms aggregation to see which keywords are important in groups of documents as compared to the entire set of documents I've indexed.
It works all great until a lot of documents are indexed. Then for the same query that used to work, elasticsearch only says:
SearchPhaseExecutionException[Failed to execute phase [query],
all shards failed; shardFailures {[OIWBSjVzT1uxfxwizhS5eg][demo_paragraphs][0]:
CircuitBreakingException[Data too large, data for field [text]
would be larger than limit of [633785548/604.4mb]];
My query looks the following:
POST /demo_paragraphs/_search
{
"query": {
"match": {
"django_target_id": 1915661
}
},
"aggregations" : {
"signKeywords" : {
"significant_terms" : {
"field" : "text"
}
}
}
}
And the document structure:
"_source": {
"django_ct": "citations.citation",
"django_target_id": 1915661,
"django_id": 3414077,
"internal_citation_id": "CR7_151",
"django_source_id": 1915654,
"text": "Mucin 1 (MUC1) is a protein heterodimer that is overexpressed in lung cancers [6]. MUC1 consists of two subunits, an N-terminal extracellular subunit (MUC1-N) and a C-terminal transmembrane subunit (MUC1-C). Overexpression of MUC1 is sufficient for the induction of anchorage independent growth and tumorigenicity [7]. Other studies have shown that the MUC1-C cytoplasmic domain is responsible for the induction of the malignant phenotype and that MUC1-N is dispensable for transformation [8]. Overexpression of",
"id": "citations.citation.3414077",
"num_distinct_citations": 0
}
The data that I index are paragraphs from scientifical papers. No document is really large.
Any ideas on how to analyze or solve the problem?
If the data set is to large to compute result on one machine you may need more then one node.
Be thoughtful when planning shard distribution. Make sure that shards are properly distributed so each node is equally stressed when computing heavy queries. A good topology for large data sets is Master-Data-Search configuration where you have one node which acts as master (no data, no queries running on this node). A few nodes are dedicated for holding data (shards) and some nodes are dedicated to execute queries (they do not hold data they use data nodes for partial query execution and combine results). For starter Netflix is using this topology Netflix raigad
Paweł Róg is right you will need much more RAM. For a starter increase java heap size available to each node. See this site for details: ElasticSearch configuration
You have to reasearch how much RAM is enough. Sometimes too much RAM actually slows down ES (unless it was fixed in one of recent versions).
I think there is simple solution.
Please give ES more RAM :D Aggregations require much memory.
Note that coming in elasticsearch 6.0 there is the new significant_text aggregation which doesn't require field data. See https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-bucket-significanttext-aggregation.html

elasticsearch - routing VS. indexing for query performance

I'm planning a strategy for querying millions of docs in date and user directions.
Option 1 - indexing by user. routing by date.
Option 2 - indexing by date. routing by user.
What are the differences or advantages when using routing or indexing?
One of the design patterns that Shay Banon # Elasticsearch recommends is: index by time range, route by user and use aliasing.
Create an index for each day (or a date range) and route documents on user field, so you could 'retire' older logs and you don't need queries to execute on all shards:
$ curl -XPOST localhost:9200/user_logs_20140418 -d '{
"mappings" : {
"user_log" : {
"_routing": {
"required": true,
"path": "user"
},
"properties" : {
"user" : { "type" : "string" },
"log_time": { "type": "date" }
}
}
}
}'
Create an alias to filter and route on users, so you could query for documents of user_foo:
$ curl -XPOST localhost:9200/_aliases -d '{
"actions": [{
"add": {
"alias": "user_foo",
"filter": {"term": {"user": "foo"}},
"routing": "foo"
}
}]
}'
Create aliases for time windows, so you could query for documents this_week:
$ curl -XPOST localhost:9200/_aliases -d '{
"actions": [{
"add": {
"index": ["user_logs_20140418", "user_logs_20140417", "user_logs_20140416", "user_logs_20140415", "user_logs_20140414"],
"alias": "this_week"
},
"remove": {
"index": ["user_logs_20140413", "user_logs_20140412", "user_logs_20140411", "user_logs_20140410", "user_logs_20140409", "user_logs_20140408", "user_logs_20140407"],
"alias": "this_week"
}
}]
}'
Some of the advantages of this approach:
if you search using aliases for users, you hit only shards where the users' data resides
if a user's data grows, you could consider creating a separate index for that user (all you need is to point that user's alias to the new index)
no performance implications over allocation of shards
you could 'retire' older logs by simply closing (when you close indices, they consume practically no resources) or deleting an entire index (deleting an index is simpler than deleting documents within an index)
Indexing is the process of parsing
[Tokenized, filtered] the document that you indexed[Inverted Index]. It's like appendix of an text book.
When the indexed data exceeds one server limit. instead of upgrading server configurations, add another server and share data with them. This process is called as sharding.
If we search it will search in all shards and perform map reduce and return results.If we group similar data together and search some data in specific data means it reduce processing power and increase speed.
Routing is used to store group of data in particular shards.To select a field for routing. The field should be present in all docs,field should not contains different values.
Note:Routing should be used in multiple shards environment[not in single node]. If we use routing in single node .There is no use of it.
Let's define the terms first.
Indexing, in the context of Elasticsearch, can mean many things:
indexing a document: writing a new document to Elasticsearch
indexing a field: defining a field in the mapping (schema) as indexed. All fields that you search on need to be indexed (and all fields are indexed by default)
Elasticsearch index: this is a unit of configuration (e.g. the schema/mapping) and of data (i.e. some files on disk). It's like a database, in the sense that a document is written to an index. When you search, you can reach out to one or more indices
Lucene index: an Elasticsearch index can be divided into N shards. A shard is a Lucene index. When you index a document, that document gets routed to one of the shards. When you search in the index, the search is broadcasted to a copy of each shard. Each shard replies with what it knows, then results are aggregated and sent back to the client
Judging by the context, "indexing by user" and "indexing by date" refers to having one index per user or one index per date interval (e.g. day).
Routing refers to sending documents to shards as I described earlier. By default, this is done quite randomly: a hash range is divided by the number of shards. When a document comes in, Elasticsearch hashes its _id. The hash falls into the hash range of one of the shards ==> that's where the document goes.
You can use custom routing to control this: instead of hashing the _id, Elasticsearch can hash a routing value (e.g. the user name). As a result, all documents with the same routing value (i.e. same user) land on the same shard. Routing can then be used at query time, so that Elasticsearch queries just one shard (per index) instead of N. This can bring massive query performance gains (check slide 24 in particular).
Back to the question at hand, I would take it as "what are the differences or advantages when breaking data down by index or using routing?"
To answer, the strategy should account for:
how indexing indexing (writing) is done. If there's heavy indexing, you need to make sure all nodes participate (i.e. write similar amounts of data on the same number of shards), otherwise there will be bottlenecks
how data is queried. If queries often refer to a single user's data, it's useful to have data already broken down by user (index per user or routing by user)
total number of shards. The more shards, nodes and fields you have, the bigger the cluster state. If the cluster state size becomes large (e.g. larger than a few 10s of MB), it becomes harder to keep in sync on all nodes, leading to cluster instability. As a rule of thumb, you'll want to stay within a few 10s of thousands of shards in a single Elasticsearch cluster
In practice, I've seen the following designs:
one index per fixed time interval. You'll see this with logs (e.g.
Logstash writes to daily indices by default)
one index per time interval, rotated by size. This maintains constant index sizes even if write throughput varies
one index "series" (either 1. or 2.) per user. This works well if you have few users, because it eliminates filtering. But it won't work with many users because you'd have too many shards
one index per time interval (either 1. or 2.) with lots of shards and routing by user. This works well if you have many users. As Mahesh pointed out, it's problematic if some users have lots of data, leading to uneven shards. In this case, you need a way to reindex big users into their own indices (see 3.), and you can use aliases to hide this logic from the application.
I didn't see a design with one index per user and routing by date interval yet. The main disadvantage here is that you'll likely write to one shard at a time (the shard containing today's hash). This will limit your write throughput and your ability to balance writes. But maybe this design works well for a high-but-not-huge number of users (e.g. 1K), few writes and lots of queries for limited time intervals.
BTW, if you want to learn more about this stuff, we have an Elasticsearch Operations training, where we discuss a lot about architecture, trade-offs, how Elasticsearch works under the hood. (disclosure: I deliver this class)

Resources