Docker Elasticsearch Bulk index timeout - elasticsearch

I am running Elasticsearch 2.3 using the docker official builds. I am trying to bulk index a fairly large dataset. The dataset in question is abotu 700mb and on a non dockerized setup takes around 30 minutes. Around 24 hours ago I started the bulk index operation on the docker elasticsearch container. As of yet it still hasn't completed, worse there is no load on the server which indicates it's not even attempting to index.
I know the bulk indexing works because I can index a smaller dataset and it works without a problem.
Is there any specific settings that I need to be aware of when indexing data over a certain size? or any way to check why it errored?
Thanks in advance.

For any future people reading this, firstly Hello from the past!
Secondly, elasticsearch has a default bulk maximum size of 100mb so make sure you're requests (including posted files) are below that

Related

Elasticsearch reindex gets stuck

Context
We have two Elasticsearch clusters with 6 and 3 nodes each. The cluster with 6 nodes is the one we use in production environment and we use the one with 3 nodes for testing purposes. (We have the same problem in both clusters). All the nodes have the following characteristics:
Elasticsearch 7.4.2
1TB HDD disk
8 GB RAM
In our case, we need to reindex some of the indexes. Those indexes have billions of documents and a size between 50GB and 250GB.
Problem
Whenever we start reindexing, internally or from a remote source, the task starts working correctly but it reaches a point where it stops reindexing, without apparent reason. We can´t see anything in the logs. The task is not cancelled or anything, it only stops reindexing documents, it looks like the task gets stuck. We tried changing GC strategies, we used CMS and Shenandoah but nothing changes.
Has anyone run into the same problem?
It's difficult to find the RCA of these issues without debugging it and with the little information you provided(missing cluster and index configuration, index slow logs information, elasticsearch error logs, Elasticsearch hot threads to name a few).

Limit disk usage on Elasticsearch

Sorry if this is a simple question - I'm new to ELK and have it all running with data coming through ok. My issue is that I'm concerned about storage growth given the number of records that will be coming through.
Having a search on the google I've seen that on GrayLog there is a setting to limit the amount of data to retain ( Graylog2- how to config logs retention to 1 week ) and I'd like to do the same in ELK but I can't find the correct setting.
There is no easy way to do this in GUI (yet). What you need is the Curator that can delete or rollup indices based on time (delete indices older than 7 days) or amount of documents in an index.
In a future Version there will be an inbuilt tool for that in Kibana, but it´s not in the current release (6.5). It will probably release with Elastic 6.6 (as a beta), but you may even have to wait for 7.X

Search/Filter/Sort on constantly changing 1 million documents

My use-case is I have max 1 Million documents and documents getting updated constantly (once every 5 mins). Each document has almost 40 columns and I have sort/filter/search requirements on almost every column.
Since the documents are changing constantly, the doc value 5 minutes earlier is not valid anymore. I am thinking that an ideal DB component will need to be running in memory. For the other use-cases in the application (where documents do not change constantly), I am using ElasticSearch cluster. So to be consistent with the search elsewhere in the application, I want to explore if I can run a separate ES node/cluster purely in memory for my use-case above. I could not find any examples or precursors for running ElasticSearch in production in a pure in-memory configuration.
If not ES, can I run Apache Solr in memory? I can try out any technology which allows me to run in a pure in-memory mode, and provide functionality similar to ES (free text search at a per-column level).
What would you recommend for this use-case?

Re-processing data for Elasticsearch with a new pipeline

I have an ELK-stack server that is being used to analyse Apache web log data. We're loading ALL of the logs, going back several years. The purpose is to look at some application-specific trends over this time period.
The data-processing pipeline is still being tweaked, as this is the first time anyone has looked in detail into this data and some people are still trying to decide how they want the data to be processed.
Some changes were suggested and while they're easy enough to do in the logstash pipeline for new, incoming data, I'm not sure how to apply these changes to the data that's already in elastic. It took several days to load the current data set, and quite a bit more data has been added so re-processing everything through logstash, with the modified pipeline will probably take several days longer.
What's the best way to apply these changes to data that has already been ingested into elastic? In the early stages of testing this set-up, I would just remove the index and rebuild from scratch, but that was done with very limited data sets and with the amount of data in use here, I'm not sure that's feasible. Is there a better way?
Setup an ingest pipeline and use reindex API to move data from current index to new index (with the pipeline configured for destination index)
Ingest Node

Elasticsearch indexes but does not store documents

I'm having troubles storing documents within a 3-node Elasticsearch cluster that previously was able to store documents. I use the Java API to send bulks of documents to Elasticsearch, which are accepted (no failure in BulkResponse object) AND Elasticsearch has heavy index activities. However, the number of documents are not increased and I assume that none of them are store.
I've looked into Elasticsearch logs (of all three nodes) but I see no errors or warnings.
Note: I've had to restart two nodes previously but search/query is working perfectly. (the count in the image starts at ~17:00 as I've installed the Marvel plugin at this time)
What can I do to solve or debug the problem?
Sorry for this point blank code blindness by me! I forgot to skip the cursor when reading from MongoDB and therefore re-inserted the same 1000 documents into Elasticsearch for thousands of times!
Learning: If this problem occurs check if you select the correct documents in your database and that these documents are not already stored in ES.
Sidenote to Marvel: It would be great is this could be indicated in any way - e.g. by having a chart with "updated documents" (I rechecked and could not find one)

Resources