Indexing 7TB of data with elasticsearch. FScrawler stops after sometime - elasticsearch

I am using fscrawler to create an index of data above 7TB. The indexing starts fine but then stops when the index size gets to 2.6gb. I believe this is a memory issue, how do I configure the memory?
My machine memory is 40GB and I have assigned 12GB to elasticsearch.

You might have to assign as well enough memory to FSCrawler using FS_JAVA_OPTS. Like:
FS_JAVA_OPTS="-Xmx4g -Xms4g" bin/fscrawler

Related

Elasticsearch reindex gets stuck

Context
We have two Elasticsearch clusters with 6 and 3 nodes each. The cluster with 6 nodes is the one we use in production environment and we use the one with 3 nodes for testing purposes. (We have the same problem in both clusters). All the nodes have the following characteristics:
Elasticsearch 7.4.2
1TB HDD disk
8 GB RAM
In our case, we need to reindex some of the indexes. Those indexes have billions of documents and a size between 50GB and 250GB.
Problem
Whenever we start reindexing, internally or from a remote source, the task starts working correctly but it reaches a point where it stops reindexing, without apparent reason. We canĀ“t see anything in the logs. The task is not cancelled or anything, it only stops reindexing documents, it looks like the task gets stuck. We tried changing GC strategies, we used CMS and Shenandoah but nothing changes.
Has anyone run into the same problem?
It's difficult to find the RCA of these issues without debugging it and with the little information you provided(missing cluster and index configuration, index slow logs information, elasticsearch error logs, Elasticsearch hot threads to name a few).

Elasticsearch reindex store sizes vary greatly

I am running Elasticsearch 6.2.4. I have a program that will automatically create an index for me as well as the mappings necessary for my data. For this issue, I created an index called "landsat" but it needs to actually be named "landsat_8", so I chose to reindex. The original "landsat" index has 2 shards and 0 read replicas. The store size is ~13.4gb with ~6.6gb per shard and the index holds just over 515k documents.
I created a new index called "landsat_8" with 5 shards, 1 read replica, and started a reindex with no special options. On a very small Elastic Cloud cluster (4GB RAM), it finished in 8 minutes. It was interesting to see that the final store size was only 4.2gb, yet it still held all 515k documents.
After it was finished, I realized that I failed to create my mappings before reindexing, so I blew it away and started over. I was shocked to find that after an hour, the /cat/_indices endpoint showed that only 7.5gb of data and 154,800 documents had been reindexed. 4 hours later, the entire job seemed to have died at 13.1gb, but it only showed 254,000 documents had been reindexed.
On this small 4gb cluster, this reindex operation was maxing out CPU. I increased the cluster to the biggest one Elastic Cloud offered (64gb ram), 5 shards, 0 RR and started the job again. This time, I set the refresh_interval on the new index to -1 and changed the size for the reindex operation to 2000. Long story short, this job ended in somewhere between 1h10m and 1h19m. However, this time I ended up with a total store size of 25gb, where each shard held ~5gb.
I'm very confused as to why the reindex operation causes such wildly different results in store size and reindex performance. Why, when I don't explicitly define any mappings and let ES automatically create mappings, is the store size so much smaller? And why, when I use the exact same mappings as the original index, is the store so much bigger?
Any advice would be greatly appreciated. Thank you!
UPDATE 1:
Here are the only differences in mappings:
The left image is "landsat" and the right image is "landsat_8". There is a root level "type" field and a nested "properties.type" field in the original "landsat" index. I forgot one of my goals was to remove the field "properties.type" from the data during the reindex. I seem to have been successful in doing so, but at the same time, accidentally renamed the root-level "type" field mapping to "provider", thus "landsat_8" has an unused "provider" mapping and an auto-created "type" mapping.
So there are some problems here, but I wouldn't think this would nearly double my store size...

Memory consumed by persistent index

We are using arangodb 3.1.3 for our project and we have created a collection with 1GB of data.
When we uploaded the data without creating persistent index for the attributes in the document, the memory consumed by the indexes as shown in the web console is 225.4 MB of memory.
When we uploaded the data by creating a persistent index for one of the attributes which is present in all the documents, the memory size was still the same. We assumed that the persistent index would consume more memory. But it did not.
How should we measure the memory size in Arangodb especially index memory ?
I believe you can get the index's size through arangodbsh, as in:
db.[collectionName].figures()
There's another SO question similar this, but I can't seem to find it now.

Docker Elasticsearch Bulk index timeout

I am running Elasticsearch 2.3 using the docker official builds. I am trying to bulk index a fairly large dataset. The dataset in question is abotu 700mb and on a non dockerized setup takes around 30 minutes. Around 24 hours ago I started the bulk index operation on the docker elasticsearch container. As of yet it still hasn't completed, worse there is no load on the server which indicates it's not even attempting to index.
I know the bulk indexing works because I can index a smaller dataset and it works without a problem.
Is there any specific settings that I need to be aware of when indexing data over a certain size? or any way to check why it errored?
Thanks in advance.
For any future people reading this, firstly Hello from the past!
Secondly, elasticsearch has a default bulk maximum size of 100mb so make sure you're requests (including posted files) are below that

Solr Performance Issues (Caching/RAM usage)

We are using Solr 5.2 (on windows 2012 server/jdk 1.8) for document content indexing/querying. We found that querying slows down intermittently under load condition.
In our analysis we found following two issues.
Solr is not effectively using caches
Whenever new document indexed, it opens new searcher and cache will become invalid (as it was associated with old Index Searcher). In our scenario, new documents are indexed very frequently (at least 10 document are indexed per minute). So effectively cache will not be useful as it will open new searcher frequently to make new documents available for searching. How can improve caching usage?
RAM is not utilized
We observed that Solr is using only 1-2 GB of heap even though we have assign 50 GB. Seems like it is not loading index into RAM which leads to high IO. Is it possible to configure Solr to fully load indexes in memory? Don't find any documentation about this.

Resources