We have a two node cluster (VM in a private cloud, 64GB of RAM, 8 core CPU each node, CentOS), a few small indices ( ~1 mil documents) and one big index with ~220 mil docs (2 shards, 170GB of space). 24GB of memory is allocated to elastic search on each box.
Document structure:
{
'article_id': {
'index': 'not_analyzed',
'store': 'yes',
'type': 'long'
},
'feed_id': {
'index': 'not_analyzed',
'store': 'yes',
'type': 'string'
},
'title': {
'index': 'analyzed',
'type': 'string'
},
'content': {
'index': 'analyzed',
'type': 'string'
},
'lang': {
'index': 'not_analyzed',
'type': 'string'
}
}
It takes about 1-2 seconds to run the following query:
{
"query" : {
"multi_match" : {
"query" : "some search term",
"fields" : [ "title", "content" ],
"type": "phrase_prefix"
}
},
"size": 20,
"fields" :["article_id", "feed_id"]
}
Are we hitting hardware limits at this point or are there ways to optimize the query or data structure to increase performance?
Thanks in advance!
It's possible you are hitting the limits of your hardware, but there are a few things you can do to your query first to help optimize it.
Max Expansions
The first thing I would do is limit max_expansions. The way the prefix-queries work is by generating a list of prefixes that match the last token in your query. In your search query "some search term", the last token "term" would be expanded using "term" as the prefix seed. You may generate a list like this:
term
terms
terminate
terminator
termite
The prefix expansion process runs through your posting list looking for any word which matches the seed prefix. By default, this list is unbounded, which means you can generate a very large list of expansions.
The second phase rewrites your original query into a series of term queries using the expansions. The bigger the expansion list, the more terms are evaluated against your index and a corresponding decrease in speed.
If you limit the expansion process to something reasonable, you can maintain speed and still usually get good prefix matching:
{
"query" : {
"multi_match" : {
"query" : "some search term",
"fields" : [ "title", "content" ],
"type": "phrase_prefix",
"max_expansions" : 100
}
},
"size": 20,
"fields" :["article_id", "feed_id"],
}
You'll have to play with how many expansions you want. It is a tradeoff between speed and recall.
Filtering
In general, the other thing you can add is filtering. If there is some type of criteria you can filter on, you can potentially drastically improve speed. Currently, your query is executing against the entire index (250m documents), which is a lot to evaluate. If you can add filter that cuts that number down, you can see much improved latency.
At the end of the day, the fewer documents which the query evaluates, the faster the query will run. Filters decrease the number of docs that a query will see, are cached, operate very quickly, etc etc.
Your situation may not have any applicable filters, but if it does, they can really help!
File System Caching
This advice is entirely dependent on the rest of the system. If you aren't fully utilizing your heap (24gb) because you are doing simple search and filtering (e.g. not faceting / geo / heavy sorts / scripts) you may be able to reallocate your heap to the file system cache.
For example, if your max heap usage peaks at 12gb, it may make sense to decrease heap size down to 15gb. The extra 10gb that you freed will go back to the OS and help cache segments, which will help boost search performance simply by the fact that more operations are diskless.
Related
I am performing a refactor of the code to query an ES index, and I was wondering if there is any difference between the two snippets below:
"bool" : {
"should" : [ {
"terms" : {
"myType" : [ 1 ]
}
}, {
"terms" : {
"myType" : [ 2 ]
}
}, {
"terms" : {
"myType" : [ 4 ]
}
} ]
}
and
"terms" : {
"myType" : [ 1, 2, 4 ]
}
Please check this blog from Elastic discuss page which will answer your question. Coying here for quick referance:
There's a few differences.
The simplest to see is the verbosity - terms queries just list an
array while term queries require more JSON.
terms queries do not score matches based on IDF (the rareness) of
matched terms - the term query does.
term queries can only have up to 1024 values due to Boolean's max
clause count
terms queries can have more terms
By default, Elasticsearch limits the terms query to a maximum of
65,536 terms. You can change this limit using the
index.max_terms_count setting.
Which of them is going to be faster? Is speed also related to the
number of terms?
It depends. They execute differently. term queries do more expensive scoring but does so lazily. They may "skip" over docs during execution because other more selective criteria may advance the stream of matching docs considered.
The terms queries doesn't do expensive scoring but is more eager and creates the equivalent of a single bitset with a one or zero for every doc by ORing all the potential matching docs up front. Many terms can share the same bitset which is what provides the scalability in term numbers.
I have an elastic index with default sort mapping of price:
shop_prices_sort_index
"sort" : {
"field" : "enrich.price",
"order" : "desc"
},
If I insert 10 documents:
100, 98, 10230, 34, 1, 23, 777, 2323, 3, 109
And Fetch results using /_search. By default it returns documents in order of price descending.
10230, 2323...
But if I distribute my documents into 3 shards, Then the same query returns some other sequence of products:
100, 98, 34...
I am really stuck here, I am not sure if I am missing out something basic or Do I need any extra settings to make a Sorted Index behave correctly.
PS: I also tried 'routing' & 'preference'. but no luck.
Any help much appreciated.
When configuring index sorting, you're only making sure that each segment inside each shard is properly sorted. The goal of index sorting is to provide some more optimization during searches
Due to the distributed nature of ES, when your index has many shards, each shard will be properly sorted, but your search query will still need to use sorting explicitly.
So if your index settings contains the following to apply sorting at indexing time
"sort" : {
"field" : "enrich.price",
"order" : "desc"
}
your search queries will also need to contain the same sort specification at query time
"sort" : {
"field" : "enrich.price",
"order" : "desc"
}
By using index sorting you'll hit a little overhead at indexing time, but your search queries will be a bit faster in the end.
I have data stored in elastic search. One of the fields is logging level. These are defined in Java enum.
The enums are :
0 => undefined
1 => info
2 => low
3 => high
4 => fatal
EDIT:
This is what I am trying, but keep getting Variable [level] is not defined error.
curl -H 'Content-Type: application/json' "http://localhost:33206/_search" -d'
{
"sort" : {
"_script" : {
"type" : "number",
"script" : {
"lang": "painless",
"source": "params.mapping[doc['level'].value]",
"params" : {
"UNDEFINED": 0,
"INFO": 1,
"LOW": 2,
"HIGH": 3,
"FATAL": 4
}
},
"order" : "asc"
}
}
}
'
In elastic search we are storing Strings rather than the number.
If I wanted to query elastic search and have it ordered by the corresponding numbers, how do I do that? Of course sorting by string will produce wrong results.
It is not possible and not recommended with scripting as it is not good from performance perspective.
You should have a separate field where you need to store the integer value and sort it.
Reasons not to have scripts:
If possible, avoid using scripts or scripted fields in searches.
Because scripts can’t make use of index structures, using scripts in
search queries can result in slower search speeds.
If you often use scripts to transform indexed data, you can speed up
search by making these changes during ingest instead. However, that
often means slower index speeds.
And one more thing is security. There are loopholes which makes it vulnerable.
Reference
My script was correct, except for the single quote around "level". Changing it to double quotes makes it work.
I have a large number of documents(around 34719074 documents) in a type of an index(ES 2.4.4). While searching, my ES Cluster seems to be in high impact(Search Latency, CPU Usage, JVM Memory and Load Average) when the "from" parameter is high(greater than 100000, "size" parameter being constant). Any specific reason for it? My query looks like:
{
"explain": false,
"size": 100,
"from": <>,
"_source": {
"excludes": [],
"includes": [
<around 850 fields>
]
},
"sort": [
<sorting from an string field>
]
}
This is a classic problem of deep pagination. You may read the link on pagination in Elasticsearch. Essentially, to get the next set documents after skipping 100000 documents would be an memory intensive task because to attain a result set of 100000+ documents, 100000+ documents need to fetched from each shard and then processed (ranking, sorting, etc.). Ranking/Sorting over a smaller result set takes lesser time that doing that on a larger result set.
The term filter that is used:
curl -XGET 'http://localhost:9200/my-index/my-doc-type/_search' -d '{
"filter": {
"term": {
"void": false
}
},
"fields": [
[
"user_id1",
"user_name",
"date",
"status",
"q1",
"q1_unique_code",
"q2",
"q3"
]
],
"size": 50000,
"sort": [
"date_value"
]
}'
The void field is a boolean field.
The index store size is 504mb.
The elasticsearch setup consists of only a single node and the index
consists of only a single shard and 0 replicas. The version of
elasticsearch is 0.90.7
The fields mentioned above is only the first 8 fields. The actual
term filter that we execute has 350 fields mentioned.
We noticed the memory spiking by about 2-3gb though the store size is only 504mb.
Running the query multiple times seems to continuously increase the memory.
Could someone explain why this memory spike occurs?
It's quite an old version of Elasticsearch
You're returning 50,000 records in one get
Sorting the 50k records
Your documents are pretty big - 350 fields.
Could you instead return a smaller number of records? and then page through them?
Scan and Scroll could help you.
it's not clear whether you've indexed individual fields - this could help as the _source being read from disk may be incurring a memory overhead.