Solr Performance Issues (Caching/RAM usage) - caching

We are using Solr 5.2 (on windows 2012 server/jdk 1.8) for document content indexing/querying. We found that querying slows down intermittently under load condition.
In our analysis we found following two issues.
Solr is not effectively using caches
Whenever new document indexed, it opens new searcher and cache will become invalid (as it was associated with old Index Searcher). In our scenario, new documents are indexed very frequently (at least 10 document are indexed per minute). So effectively cache will not be useful as it will open new searcher frequently to make new documents available for searching. How can improve caching usage?
RAM is not utilized
We observed that Solr is using only 1-2 GB of heap even though we have assign 50 GB. Seems like it is not loading index into RAM which leads to high IO. Is it possible to configure Solr to fully load indexes in memory? Don't find any documentation about this.

Related

Elasticsearch indices eating too much space

I'm using Elasticsearch 7.5.2 on Ubuntu. Recently, I began using Elasticsearch to display relevant search results on every page load. This shot up the volume, but I also found out that it has created large index files. Note that I'm using 'app-search' to power my queries.
Here's the sample index files that are occupying too much space:
.app-search-analytics-logs-loco_togo_production-7.1.0-2020.01.26 => 52 GB
.app-search-analytics-logs-loco_togo_production-7.1.0-2020.01.27 => 53 GB
I tried deleting these using CURL, but they reappear and show lesser space (~5 GB each).
I want to know if there is a way to control these indexes. I'm not sure what purpose do these indices solve and if there is a way to prevent them?
I tried deleting these using CURL, but they reappear and show lesser space (~5 GB each).
Obviously your delete-action was executed. It seems like that the indices still get written to. If documents still get into elasticsearch, the index gets re-created.
So for example:
The index from 2020.01.27 has 53 GB before the deletion. After you delete it, the data is gone and the index itself too. But as soon as new documents of the very same day (2020.01.27) get indexed, the index gets re-created containing the documents after the deletion which is probably the 5GB.
If this is not what you want, you need to check if there are some sources still sending data.
Hope this helps.
EDIT:
Q: However, is there a way to manage these indices? I don't want them to eat up too much space.
Yes! Index Lifecycle Management (ILM) is what you are looking for. It aims to automate the maintenance/management of indices. So for example you could define a rollover every 30GB to a new index in order to keep them small. Another example is to delete the index after X days. Take a look at all the phases and actions.

Search/Filter/Sort on constantly changing 1 million documents

My use-case is I have max 1 Million documents and documents getting updated constantly (once every 5 mins). Each document has almost 40 columns and I have sort/filter/search requirements on almost every column.
Since the documents are changing constantly, the doc value 5 minutes earlier is not valid anymore. I am thinking that an ideal DB component will need to be running in memory. For the other use-cases in the application (where documents do not change constantly), I am using ElasticSearch cluster. So to be consistent with the search elsewhere in the application, I want to explore if I can run a separate ES node/cluster purely in memory for my use-case above. I could not find any examples or precursors for running ElasticSearch in production in a pure in-memory configuration.
If not ES, can I run Apache Solr in memory? I can try out any technology which allows me to run in a pure in-memory mode, and provide functionality similar to ES (free text search at a per-column level).
What would you recommend for this use-case?

Elasticsearch reindex store sizes vary greatly

I am running Elasticsearch 6.2.4. I have a program that will automatically create an index for me as well as the mappings necessary for my data. For this issue, I created an index called "landsat" but it needs to actually be named "landsat_8", so I chose to reindex. The original "landsat" index has 2 shards and 0 read replicas. The store size is ~13.4gb with ~6.6gb per shard and the index holds just over 515k documents.
I created a new index called "landsat_8" with 5 shards, 1 read replica, and started a reindex with no special options. On a very small Elastic Cloud cluster (4GB RAM), it finished in 8 minutes. It was interesting to see that the final store size was only 4.2gb, yet it still held all 515k documents.
After it was finished, I realized that I failed to create my mappings before reindexing, so I blew it away and started over. I was shocked to find that after an hour, the /cat/_indices endpoint showed that only 7.5gb of data and 154,800 documents had been reindexed. 4 hours later, the entire job seemed to have died at 13.1gb, but it only showed 254,000 documents had been reindexed.
On this small 4gb cluster, this reindex operation was maxing out CPU. I increased the cluster to the biggest one Elastic Cloud offered (64gb ram), 5 shards, 0 RR and started the job again. This time, I set the refresh_interval on the new index to -1 and changed the size for the reindex operation to 2000. Long story short, this job ended in somewhere between 1h10m and 1h19m. However, this time I ended up with a total store size of 25gb, where each shard held ~5gb.
I'm very confused as to why the reindex operation causes such wildly different results in store size and reindex performance. Why, when I don't explicitly define any mappings and let ES automatically create mappings, is the store size so much smaller? And why, when I use the exact same mappings as the original index, is the store so much bigger?
Any advice would be greatly appreciated. Thank you!
UPDATE 1:
Here are the only differences in mappings:
The left image is "landsat" and the right image is "landsat_8". There is a root level "type" field and a nested "properties.type" field in the original "landsat" index. I forgot one of my goals was to remove the field "properties.type" from the data during the reindex. I seem to have been successful in doing so, but at the same time, accidentally renamed the root-level "type" field mapping to "provider", thus "landsat_8" has an unused "provider" mapping and an auto-created "type" mapping.
So there are some problems here, but I wouldn't think this would nearly double my store size...

How to stop auto reindexing in elastic search if any update happens?

I am having a big use case with elasticsearch which has millions of records in it.
I will be updating the records frequently, say 1000 records per hour.
I don't want elastic search to reindex for my every update.
I am planning to reindex it on weekly basis.
Any Idea how to stop auto-reindex while update ?
Or any other better suggestion is welcome . Thanks in advance :)
Elasticsearch(ES) update an existing doc in below manner.
1. Deletes the old doc.
2. Index a new doc with the changes applied to it.
According to ES docs :-
In Elasticsearch, this lightweight process of writing and opening a
new segment is called a refresh. By default, every shard is refreshed
automatically once every second. This is why we say that Elasticsearch
has near real-time search: document changes are not visible to search
immediately, but will become visible within 1 second.
Note that these changes will not be visible/searchable until ES commits/flush these changes to disk cache and disk,which is control by soft-commit(es refresh interval, which is by default 1 second) and hard-commit(which actually write the document to disk, which prevent it being lost permanently and costly affair than a soft-commit).
You need to make sure, you tune your ES refresh interval, and do proper load testing, as setting it very low and very high has its own pros and cons.
for example setting it very less for example 1 second and if you have too many updates happening than it has a performance hit and it might crash your system. Also setting it very high for example 1 hour means you now don't have a NRT(near real time search) and during that time if your memory could contain again millions of doc(depending on your app) and can cause out of memory error, also committing on such a large memory is a very costly affair.

Memory consumed by persistent index

We are using arangodb 3.1.3 for our project and we have created a collection with 1GB of data.
When we uploaded the data without creating persistent index for the attributes in the document, the memory consumed by the indexes as shown in the web console is 225.4 MB of memory.
When we uploaded the data by creating a persistent index for one of the attributes which is present in all the documents, the memory size was still the same. We assumed that the persistent index would consume more memory. But it did not.
How should we measure the memory size in Arangodb especially index memory ?
I believe you can get the index's size through arangodbsh, as in:
db.[collectionName].figures()
There's another SO question similar this, but I can't seem to find it now.

Resources