elasticsearch in memory speed - performance

I'm trying to test how much faster would be the in-memory solution with elasticsearch.
For this, I wrote a test in which Im generating ~10milion records and after that performing a text search. Result comes in 3-20ms but there is no difference (at all) when I do the search in memory and without this setting. Is it possible? Is 10million records too small to see any difference? Im not even 100% sure if I enabled the in-memory mode correctly. Im loading the settings from a json file, in which I places some settings I found on internet that was supposed to improve overall solution, but it seems like its not working at all.
The settings regarding index looks like this:
"index": {
"store": {
"type":"memory"
},
"merge": {
"policy": {
"use_compound_file": false
}
},
"translog": {
"flush_threshold": 50000
},
"engine": {
"robin": {
"refresh_interval": 2
}
},
"cache": {
"field": {
"max_size": 500000,
"expire": "30m"
}
}
},
"indices": {
"memory": {
"index_buffer_size": 256
}
},

I don't know if you are using in-memory storage wisely or not. you can just match what type of storage do you need here.
But, You have to provide storage setting, while creating the index (make sure that index doesn't exists previously)
Try this,
curl -XPUT "http://localhost:9200/my_index/" -d'
{
"settings": {
"index.store.type": "memory"
}
}'
This will create index, which will stores the index in main memory, using Lucene’s RamIndexStore.

Related

What is the best way to update cache in elasticsearch

I'm using elasticsearch index as a cache table.
My document structure is the following:
{
"mappings": {
"dynamic": False,
"properties": {
"query_str": {"type": "text"},
"search_results": {
"type": "object",
"enabled": false
},
"query_embedding": {
"type": "dense_vector",
"dims": 768,
},
}
}
The cache search is performed via embedding vector similarity. So if the embedding of the new query is close enough to a cached one, it is considered as a cache hit, and search_results field is returned to the user.
The problem is that I need to update cached results about once an hour. I wish my service won't lose the ability to use cache efficiently while updating procedure, so I'm not sure which one of solutions is the best:
Sequentially update documents one-by-one, so the index won't be destroyed. The drawback of this solution I afraid is the fact, that every update causes index rebuilding, so the cache requests will become slow
Create entirely new index with new results and then somehow swap current cache index with the new one. The drawbacks I see are
a) I've found no elegant way to swap indexes
b) Users will get their cached resuts lately than in solution (1)
I would go with #2 as everytime you update a document the cache is flushed.
There is an elegant way to swap indices:
You have an alias that points to your current index, you fill a new index with the fresh records, and then you point this alias to the new index.
Something like this:
Current index name is items-2022-11-26-001
Create alias items pointing to items-2022-11-26-001
POST _aliases
{
"actions": [
{
"add": {
"index": "items-2022-11-26-001",
"alias": "items"
}
}
]
}
Create new index with fresh data items-2022-11-26-002
When it finishes, now point the items alias to items-2022-11-26-002
POST _aliases
{
"actions": [
{
"remove": {
"index": "items-2022-11-26-001",
"alias": "items"
}
},
{
"add": {
"index": "items-2022-11-26-002",
"alias": "items"
}
}
]
}
Delete items-2022-11-26-001
You run all your queries against "items" alias that will act as an index.
References:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html

Fielddata is disabled on text fields by default

I've encountered a classic problem, hovewer, no page on SO or any other Q&A or forum has helped me.
I need to extract a numerical value of parameter "wsProcessingElapsedTimeMS" out of string, like (where the parameter is contained in the message field):
2018-07-31 07:37:43,740|DEBUG|[ACTIVE] ExecuteThread: '43' for queue:
'weblogic.kernel.Default (self-tuning)'
|LoggerHandler|logMessage|sessionId=9AWTu
wsOperationName=FindBen wsProcessingEndTime=2018-07-31 07:37:43.738
wsProcessingElapsedTimeMS=6 httpStatus=200 outgoingAddress=172.xxx.xxx.xxx
and keep getting error:
"type":"illegal_argument_exception","reason":"Fielddata is disabled on text
fields by default. Set fielddata=true on [message] in order to load fielddata
in memory by uninverting the inverted index. Note that this can however use
significant memory. Alternatively use a keyword field instead."
Point is, I've already did run the query (by Dev Tools in Kibana GUI if that matters) to mark a field as fielddata in following way:
PUT my_index/_mapping/message
{
"message": {
"properties": {
"publisher": {
"type": "text",
"fielddata": true
}
}
}
}
, which returned brief information:
{
"acknowledged": true
}
After which I've tried to rebuilt the index like:
POST _reindex?wait_for_completion=false
{
"source": {
"index": "my_index"
},
"dest": {
"index": "my_index"
}
}
(the ?wait_for_completion=false flag is set because otherwise it was a timeout; theres a lot of data in the system now).
And finally, having performed above steps, I've also tried to relaunch the kibana and elasticsearch services (processes) to force a reindexing (which tool really long).
Also, using the "message.keyword" instead of "message" (as suggested in the official documentation) is not helping - it's just empty in most of cases.
I'm using the Kibana to access the ElasticSearch engine.
ElasticSearch v. 5.6.3
Kibana v. 5.6.3
Logstash v. 5.5.0
Any suggestion will be appreciated, even regarding the use of additional plugins (provided they have a release compliant with above Kibana/ElasticSearch/Logstash versions, as I can't update them to newer right now).
PUT your_index/_mapping/your_type
{
"your_type": {
"properties": {
"publisher": {
"type": "text",
"fielddata": true
}
}
}
}

Reindex fail due to SearchContextMissingException

My company is using elasticsearch 2.3.4.
We have a cluster that contains 38 ES nodes, and we've been having a problem with reindexing some of our data lately...
We've reindexed before very large indexes and had no problems, but recently, when trying to reindex much smaller indexed (less than 10GB) - we get : "SearchContextMissingException [No search context found for id [XXX]]".
We have no idea what's causing this problem or how to fix it. We'd like some guidance.
Has anyone saw this exception before?
From github comments on issues related to this , i think this can be avoided by changing batch size :
From documentation:
By default _reindex uses scroll batches of 1000. You can change the batch size with the size field in the source element:
POST _reindex
{
"source": {
"index": "source",
"size": 100
},
"dest": {
"index": "dest",
"routing": "=cat"
}
}
I had the same problem with an index that holds many huge documents. I had to reduce the batch size down to 10. (100 and 50 both didn't work).
This was the request that worked in the end:
POST _reindex?slices=5&refresh
{
"source": {
"index": "source_index",
"size": 10
},
"dest": {
"index": "dest_index"
}
}
You should also set the slices to the number of shards you have in your index.

FIELDDATA Data is too large

I open kibana and do a search and i get the error where shards failed. I looked in the elasticsearch.log file and I saw this error:
org.elasticsearch.common.breaker.CircuitBreakingException: [FIELDDATA] Data too large, data for [#timestamp] would be larger than limit of [622775500/593.9mb]
Is there any way to increase that limit of 593.9mb?
You can try to increase the fielddata circuit breaker limit to 75% (default is 60%) in your elasticsearch.yml config file and restart your cluster:
indices.breaker.fielddata.limit: 75%
Or if you prefer to not restart your cluster you can change the setting dynamically using:
curl -XPUT localhost:9200/_cluster/settings -d '{
"persistent" : {
"indices.breaker.fielddata.limit" : "40%"
}
}'
Give it a try.
I meet this problem,too.
Then i check the fielddata memory.
use below request:
GET /_stats/fielddata?fields=*
the output display:
"logstash-2016.04.02": {
"primaries": {
"fielddata": {
"memory_size_in_bytes": 53009116,
"evictions": 0,
"fields": {
}
}
},
"total": {
"fielddata": {
"memory_size_in_bytes": 53009116,
"evictions": 0,
"fields": {
}
}
}
},
"logstash-2016.04.29": {
"primaries": {
"fielddata": {
"memory_size_in_bytes":0,
"evictions": 0,
"fields": {
}
}
},
"total": {
"fielddata": {
"memory_size_in_bytes":0,
"evictions": 0,
"fields": {
}
}
}
},
you can see my indexes name base datetime, and evictions is all 0. Addition, 2016.04.02 memory is 53009116, but 2016.04.29 is 0, too.
so i can make conclusion, the old data have occupy all memory, so new data cant use it, and then when i make agg query new data , it raise the CircuitBreakingException
you can set config/elasticsearch.yml
indices.fielddata.cache.size: 20%
it make es can evict data when reach the memory limit.
but may be the real solution you should add you memory in furture.and monitor the fielddata memory use is good habits.
more detail: https://www.elastic.co/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html
Alternative solution for CircuitBreakingException: [FIELDDATA] Data too large error is cleanup the old/unused FIELDDATA cache.
I found out that fielddata.limit been shared across indices, so deleting a cache of an unused indice/field can solve the problem.
curl -X POST "localhost:9200/MY_INDICE/_cache/clear?fields=foo,bar"
For more info https://www.elastic.co/guide/en/elasticsearch/reference/7.x/indices-clearcache.html
I think it is important to understand why this is happening in the first place.
In my case, I had this error because I was running aggregations on "analyzed" fields. In case you really need your string field to be analyzed, you should consider using multifields and make it analyzed for searches and not_analyzed for aggregations.
I ran into this issue the other day. In addition to checking the fielddata memory, I'd also consider checking the JVM and OS memory as well. In my case, the admin forgot to modify the ES_HEAP_SIZE and left it at 1gig.
just use:
ES_JAVA_OPTS="-Xms10g -Xmx10g" ./bin/elasticsearch
since the default heap is 1G, if your data is big ,you should set it bigger

Elasticsearch query with nested aggregations causing out of memory

I have Elasticsearch installed with 16gb of memory. I started using aggregations, but ran into a "java.lang.OutOfMemoryError: Java heap space" error when I attempted to issue the following query:
POST /test-index-syslog3/type-syslog/_search
{
"query": {
"query_string": {
"default_field": "DstCountry",
"query": "CN"
}
},
"aggs": {
"whatever": {
"terms": {
"field" : "SrcIP"
},
"aggs": {
"destination_ip": {
"terms": {
"field" : "DstIP"
},
"aggs": {
"port" : {
"terms": {
"field" : "DstPort"
}
}
}
}
}
}
}
}
The query_string itself only returns 1266 hits so I'm a bit confused by the OOM error.
Am I using aggregations incorrectly? If not, what can I do to troubleshoot this issue?
Thanks!
You are loading the entire SrcIP-, DstIP-, and DstPort-fields into memory in order to aggregate on them. This is because Elasticsearch un-inverts the entire field to be able to rapidly look up a document's value for a field given its ID.
If you're going to largely be aggregating on a very small set of data, you should look into using docvalues. Then a document's value is stored in a way that makes it easy to look up given the document's ID. There's a bit more overhead to it, but that way you'll leave it to the operating system's field cache to have the relevant pages in memory, instead of having to load the entire field.
Not sure about the mapping of course, but looking at the value the field DstCountry can be non_analyzed. Than you could replace the query by a filter within the aggregate. Maybe that helps.
Also check if the fields you use in your aggregation are of type non_analyzed.

Resources