I have a cluster of 10 nodes where I index about a 100 million records daily. Total close to 6 billion records. I am constantly loading data. Each record has about 75 fields associated with it. 99% of my queries are based on the same field query. Essentially select * from table where groupid = 'value'. The majority of the queries returning bring back about a hundred records.
My queries currently take about 30 seconds to run the first 2 times and then are in the milliseconds. The problem is that all the user queries are searching for a different groupID so there queries are going to be slow for the most part until they run it the third time.
Is it possible to "cache" the groupid field so that I can get sub second queries.
My current query looks like this. (Psuedo-query) (I'm using non-analyzed field which I believe is better?)
query : {
filtered : {
filter : {
"term" : { groupID : "valuex" }
}
}
}
I"ve researched and not sure how to go about this. I've looked into doc_values = yes and possibly field cache?
I do not care about scoring, aggregates. My only use case is to filter out records and only bringing back the 100 or so out of 5 billion that have the correct groupID.
We have about 64G Memory on each server.
Just looking for help on how to achieve optimal performance/caching? or anything else that would help.
I thought about routing but this would be difficult based on our groupid values.
thanks
Starting from elasticsearch 2.0 we did some caching changes, like:
Keeps track of 256 most recently used queries
Only caches those that appear 5 times or more
Does not cache segments which have less than 10000 documents or 3% of the documents of the index
Wondering if you are hitting this last one.
Note that we did that because the File System cache might be probably better than internal caching.
Could you try with a bool query instead of a filtered query BTW? Filtered has been deprecated (and is removed in 5.0). And see how it performs?
Related
I have an Elastic Search 5.2 cluster with 16 nodes (13 data nodes/3 master/24 GB RAM/12 GB Heap). I am performance testing a query and making 50 calls of a search query per second on the Elastic cluster. My query looks like the following -
{
"query": {
"bool": {
"must": [
{
"term": {
"cust_id": "AC-90-464690064500"
}
},
{
"range": {
"yy_mo_no": {
"gt": 201701,
"lte": 201710
}
}
}
]
}
}
}
My index mapping is like the following -
cust_id Keyword
smry_amt Long
yy_mo_no Integer // doc_values enabled
mkt_id Keyword
. . .
. . .
currency_cd Keyword // Total 10 field with 8 Keyword type
The index contains 200 million records and for each cust_id, there may be 100s of records. Index has 2 Replicas. The record size is under 100 bytes.
When I run the performance test for 10 minutes, the query response and performance seems to be very slow. Upon investigating a bit more in details in Kibana monitoring tab, It appears that there is a lot of Garbage Collection activity happening (pls. see Image below) -
I have several question clouding in my mind. I did some research on Range queries but didn't find much on what can cause GC activity in scenarios similar to mine. I also research on Memory usage and GC activity, but most of Elastic documentation refers that young generation GC is normal while Indexing, while search activity mostly use the file system cache that OS maintains. Thats why, in the chart above, Heap is not much used since Search was using File System cache.
So -
What might be causing the garbage collection to happen here ?
The chart shows that the Heap is still available to Elastic Search, and Used Heap is still very less as compared to available. Then what is triggering GC ?
Is the query type causing any internal data structure to be created that is getting disposed off, causing GC ?
The CPU spike may be due to GC activity.
Is there any other efficient way of running the Range query in Elastic Search pre 5.5 versions ?
Profiling the query tells that Elastic is running a TermQuery and a BooleanQuery with the later is costing the most.
Any idea whats going on here ?
Thanks in Advance,
SGSI.
The correct answer depends on index settings but I guess you are using integer type with enabled docValues. This data structure is supposed to support aggregations and sorting but not range queries. The right data type is range.
In case of DocValues elastic/lucene iterates over ALL documents(i.e. full scan) in order to match range query - this require to read and decode every value from DV column - this operation is quite expensive, especially when the index can not be cached by the operating system.
I am using elasticsearch to perform some aggregations. Everything used to work fine, but currently I have 2 million docs in an index. I am performing a very simple search query list all documents in a given type of a given index.
{
"size":100000,
"query":
{"match_all":{}
}
}
This query is very slow and gives about 300k hits. What could be the possible reasons for it?
NOTE: i am having 2G ram . 2 cores
You are trying to get a response with 100.000 documents in it. This is just too much. Elasticsearch is intended for paging. Paging means fetch in small chunks. You try to fetch a bulk of 100.000. There is a reason why it defaults with a size of 10.
I finally found out that this configuration is enough for my needs that is searching over 2 million documents. i was having a wrong configuration and also the method of simply doing match_all is not correct even if we have 2 million docs performing a search based on some criteria would be very fast.
I'm having issue with scoring: when I run the same query multiple times, each documents are not scored the same way. I found out that the problem is well known, it's the bouncing result issue.
A bit of context: I have multiple shards across multiple nodes (60 shards, 10 data nodes), all the nodes are using ES 2.3 and we're heavily using nested document - the example query doesn't use them, for simplicity.
I tried to resolve it by using the preference search parameter, with a custom value. The documentation states:
A custom value will be used to guarantee that the same shards will be used for the same custom value. This can help with "jumping values" when hitting different shards in different refresh states. A sample value can be something like the web session id, or the user name.
However, when I run this query multiple times:
GET myindex/_search?preference=asfd
{
"query": {
"term": {
"has_account": {
"value": "twitter"
}
}
}
}
I end up having the same documents, but with different scoring/sorting. If I enable explain, I can see that those documents are coming from different shards.
If I use preference=_primary or preference=_replica, we have the expected behavior (always the same shard, always the same scoring/sorting) but I can't query only one or the other...
I also experimented with search_type=dfs_search_then_fetch, which should generate the scoring based on the whole index, across all shards, but I still get different scoring for each run of the query.
So in short, how do I ensure the score and the sorting of the results of a query stay the same during a user's session?
Looks like my replicas went out of sync with the primaries.
No idea why, but deleting the replicas and recreating them have "fixed" the problem... I'll need some investigations on why it went out of sync
Edit 21/10/2016
Regarding the "preference" option not being taken into account, it's linked to the AWS zone awareness: if the preferred replica is in another zone than the client node, then the preference will be ignored.
The differences between the replicas are "normal" if you delete (or update) documents, from my understanding the deleted document count will vary between the replicas, since they're not necessarily merging segments at the same time.
I have an Elasticsearch instance for indexing log records. Naturally the data grows over time and I would like to limit its size(about 10GB). Something like a mongoDb capped collection.
I'm not interested in old log records anyway.
I haven't found any config for this and I'm not sure that I can just remove data files.
any suggestions ?
The Elasticsearch "way" of dealing with "old" data is to create time-based indices. Meaning, for each day or each week you create an index. Index everything belonging to that day/week in that index.
You decide how many days you want to keep around and stick to that number. Let's say that the data for 7 days counts as 10 GB. In the 8th day you create the new index, as usual, then you delete the index from 8 days before.
All the time you'll have in your cluster 7 indices.
Using ttl as the other poster suggested is not recommended, because is far more difficult and it creates additional pressure on the cluster. The ttl mechanism checks every indices.ttl.interval (60 seconds by default) for expired documents, it creates bulk requests out of them and deletes them. This means unnecessary requests coming to the cluster.
Instead, deleting an index is very easy and quick.
Take a look at this and how to easily manage time based indices with Curator.
From what I remember a capped collection in MongoDB was just a circular buffer type of collection that removes oldest entries when there's no more room? Unfortunately there's nothing like this out of the box in ElasticSearch, you have to add this functionality yourself either by removing single documents (or batches of documents) using ES's API. A more performant way is described in their documentation under retiring data.
You can provide a per index/type default _ttl(time to live) value as follows:
{
"tweet" : {
"_ttl" : { "enabled" : true, "default" : "1d" }
}
}
You will have more detail here: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-ttl-field.html
Regards,
Alain
In order to load all the documents index by ElasticSearch, I am using the following query through tire.
def all
max = total
Tire.search 'my_documents' do
query { all }
size max
end.results.map { |entry| entry.to_hash }
end
Where max, respectively total is a count query to return the number of present documents. I have indexed about 10,000 documents. Currently, the request takes too long.
I am aware, that I should not query all documents like this. What is the best alternative here? Using pagination, if yes, toward which metric would I define the number of documents per page?
I am also planning to extend the size of the documents, to 100,000 or even 1,000,000 and I don't see yet how this can scale.
I appreciate every comment.
Rationale: I do this, because I am running calculations over these data. Hence, I need all the data, run the computations and save the results back into the documents.
Have a look at the scroll API, which is highly optimized to fetch a large amount of results. It uses the scan search type and doesn't support sorting but let you provide a query to filter the documents you want to fetch. Have a look at the reference to know more about it. Remember the size that you define in the request is per shard; that means that if you have 5 primary shards, setting 10 would lead to have 50 results back per request.