Error working with "ScrollElasticSearchHttp" processor in NiFi - elasticsearch

I am trying to retrieve data from an index in ElasticSearch. I configured the "QueryElasticSearchHttp" processor and it works just fine. However when I try to use the ScrollElasticsearchHttp processor with the same URL, query, index properties and set the 'scroll' to default 1 minute, it doesn't work.
I get an error response of 404 : "Elasticsearch returned code 404 with message Not found".
I am also tailing the log on the ES cluster and I see this error;
[DEBUG][o.e.a.s.TransportSearchScrollAction] [2] Failed to execute query phase
org.elasticsearch.transport.RemoteTransportException:[127.0.0.1:9300][indices:data/read/search[phase/query+fetch/scroll]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [2]
at org.elasticsearch.search.SearchService.getExecutor(SearchService.java:457) ~[elasticsearch-7.5.2.jar:7.5.2]
I am on Apache NiFi 1.10.0
Here is the config for the processor:
I should see a total of 441 hits, and with page size 20 I should see 23 queries being made to ES.
But I don't get a single result back. I have tried higher values for "scroll" and also played around with "page size" to no avail.
I also noticed that even though the ScrollElasticsearchHttp processor is set to run every 1m, on the ES log I don't see any error log repeated every minute.
Update:
When I cleared the state via UI: "View state" -> "Clear State", I was able to make a single call, that returned a page full of hits in one flowfile.
However, there are more pages to be retrieved. How do I make the processor to go fetch the next page?
My understanding was that the single invocation of the ScrollElasticsearchHttp will page through all the result sets and bring in each page as one flowfile. Is this not correct?

Please decrease the scheduling time to around 10-20 sec. So in every 10-20 sec processor will fetch the next set of records based on your page size.
You can check the state value when the fetching process is in progress i.e. you will find a scroll id in it. Once the fetching process is complete then state value will be changed to "finishedQuery" : true.

Related

Scroll contexts are left open and they never get deleted or expired in Elasticsearch v7.3?

I am using ES v7.3 and using slicing to stream data from ES but what i observe is that after we stream the data a few times some scroll contexts are left open and they remain open for days and does not get expired or killed and hence the search keeps on going and high cpu spikes are observed. Also in the logs we get the following message
[2020-02-07T06:49:33,559][DEBUG][o.e.a.s.TransportSearchScrollAction] [ip-1-0-104-220] [1234717] Failed to execute query phase
org.elasticsearch.transport.RemoteTransportException: [ip-1-0-104-220][1.0.104.220:9300][indices:data/read/search[phase/query/scroll]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [1234717]
at org.elasticsearch.search.SearchService.getExecutor(SearchService.java:462) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.search.SearchService.runAsync(SearchService.java:344) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:401) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.action.search.SearchTransportService.lambda$registerRequestHandler$10(SearchTransportService.java:367) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257) [x-pack-security-7.3.1.jar:7.3.1]
Would request if anyone can guide us if we are missing any setting on index level that we should enforce so that these open contexts get killed or get expired after a timeout is reached.
To check and delete the open contexts i am using the following command respectively,
GET _nodes/stats/indices?filter_path=**.open_contexts
DELETE /_search/scroll/_all
Moreover my timeouts are,
exports.ELASTICSEARCH = {
PARALLEL_SLICES : 2,
SCROLL_ALIVE_TIME : '5m',
SLICE_ALIVE_TIME : '1m',
SCROLL_SIZE : 10000,
REQUEST_RETRY_COUNT : 5,
REQUEST_TIMEOUT : 120000, // in milliSecond
ERROR_RETRY_COUNT : 3
};

Elasticsearch giving cached result even after 5-6 seconds

My System is calling elasticsearch. After updating a document I would like to fetch the same document again. While doing so elasticsearch sometimes fetches cached results (results before the update) even after retrying the elasticsearch get after 5-6 seconds.
I have used refresh:'wait_for' while updating the document. Can anyone help me what can be a workaround for this? I would like to fetch the latest revision of the updated document. My query to fetch is:
body: {
query: {
terms: {
_id: [
idsToFetch
]
}
}
}
First, you can check whats the refresh interval set for your index defaults to 1 second, in this case: refresh:wait_for should return back in maximum 1 second but as explained in official ES documents :
If the refresh interval is set to -1, disabling the automatic
refreshes, then requests with refresh=wait_for will wait indefinitely
until some action causes a refresh. Conversely, setting
index.refresh_interval to something shorter than the default like
200ms will make refresh=wait_for come back faster, but it’ll still
generate inefficient segment
You can get the whats the refresh_interval set for index using https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-get-settings.html, please note it would come in the result only if it's not set to its default value.
Let me know if you face any issue or have more question.

Array index out of bound exception while downloading elastic search index

I am trying to download complete elastic search index using:
curl -o output_filename -m 600 -GET 'http://ip/index/_search?q=*&size=7000000'.
But its giving error:
{"error":"ArrayIndexOutOfBoundsException[-131072]","status":500}
How can I download complete index data?
The scroll API is what you're looking for, which supports proper pagination:
Scrolling is not intended for real time user requests, but rather for processing large amounts of data
It's the same /_search endpoint but additional gets passed the ?scroll=<timeout> parameter.
Please be sure to understand what the timeout to e.g. scroll=1m means: it will keep alive your scroll context until you request the next batch/page.
Use the scroll_id from the response to request the next batch/page.

Spark Streaming and ElasticSearch - Could not write all entries

I'm currently writing a Scala application made of a Producer and a Consumer. The Producers get some data from and external source and writes em inside Kafka. The Consumer reads from Kafka and writes to Elasticsearch.
The consumer is based on Spark Streaming and every 5 seconds fetches new messages from Kafka and writes them to ElasticSearch. The problem is I'm not able to write to ES because I get a lot of errors like the one below :
ERROR] [2015-04-24 11:21:14,734] [org.apache.spark.TaskContextImpl]:
Error in TaskCompletionListener
org.elasticsearch.hadoop.EsHadoopException: Could not write all
entries [3/26560] (maybe ES was overloaded?). Bailing out... at
org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:225)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:236)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.hadoop.rest.RestService$PartitionWriter.close(RestService.java:125)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.spark.rdd.EsRDDWriter$$anonfun$write$1.apply$mcV$sp(EsRDDWriter.scala:33)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.apache.spark.TaskContextImpl$$anon$2.onTaskCompletion(TaskContextImpl.scala:57)
~[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66)
[spark-core_2.10-1.2.1.jar:1.2.1] at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[na:na] at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
[na:na] at
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.scheduler.Task.run(Task.scala:58)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
[spark-core_2.10-1.2.1.jar:1.2.1] at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_65] at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_65] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65]
Consider that the producer is writing 6 messages every 15 seconds so I really don't understand how this "overload" can possibly happen (I even cleaned the topic and flushed all old messages, I thought it was related to an offset issue). The task executed by Spark Streaming every 5 seconds can be summarized by the following code :
val result = KafkaUtils.createStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, Map("wasp.raw" -> 1), StorageLevel.MEMORY_ONLY_SER_2)
val convertedResult = result.map(k => (k._1 ,AvroToJsonUtil.avroToJson(k._2)))
//TO-DO : Remove resource (yahoo/yahoo) hardcoded parameter
log.info(s"*** EXECUTING SPARK STREAMING TASK + ${java.lang.System.currentTimeMillis()}***")
convertedResult.foreachRDD(rdd => {
rdd.map(data => data._2).saveToEs("yahoo/yahoo", Map("es.input.json" -> "true"))
})
If I try to print the messages instead of sending to ES, everything is fine and I actually see only 6 messages. Why can't I write to ES?
For the sake of completeness, I'm using this library to write to ES : elasticsearch-spark_2.10 with the latest beta version.
I found, after many retries, a way to write to ElasticSearch without getting any error. Basically passing the parameter "es.batch.size.entries" -> "1" to the saveToES method solved the problem. I don't understand why using the default or any other batch size leads to the aforementioned error considering that I would expect an error message if I'm trying to write more stuff than the allowed max batch size, not less.
Moreover I've noticed that actually I was writing to ES but not all my messages, I was losing between 1 and 3 messages per batch.
When I pushed dataframe to ES on Spark, I had the same error message. Even with "es.batch.size.entries" -> "1" configuration,I had the same error.
Once I increased thread pool in ES, I could figure out this issue.
for example,
Bulk pool
threadpool.bulk.type: fixed
threadpool.bulk.size: 600
threadpool.bulk.queue_size: 30000
Like it was already mentioned here, this is a document write conflict.
Your convertedResult data stream contains multiple records with the same id. When written to elastic as part of the same batch produces the error above.
Possible solutions:
Generate unique id for each record. Depending on your use case it can be done in a few different ways. As example, one common solution is to create a new field by combining the id and lastModifiedDate fields and use that field as id when writing to elastic.
Perform de-duplication of records based on id - select only one record with particular id and discard other duplicates. Depending on your use case, this could be the most current record (based on time stamp field), most complete (most of the fields contain data), etc.
The #1 solution will store all records that you receive in the stream.
The #2 solution will store only the unique records for a specific id based on your de-duplication logic. This result would be the same as setting "es.batch.size.entries" -> "1", except you will not limit the performance by writing one record at a time.
One of the possibility is the cluster/shard status being RED. Please address this issue which may be due to unassigned replicas. Once status turned GREEN the API call succeeded just fine.
This is a document write conflict.
For example:
Multiple documents specify the same _id for Elasticsearch to use.
These documents are located in different partitions.
Spark writes multiple partitions to ES simultaneously.
Result is Elasticsearch receiving multiple updates for a single Document at once - from multiple sources / through multiple nodes / containing different data
"I was losing between 1 and 3 messages per batch."
Fluctuating number of failures when batch size > 1
Success if batch write size "1"
Just adding another potential reason for this error, hopefully it helps someone.
If your Elasticsearch index has child documents then:
if you are using a custom routing field (not _id), then according to
the documentation the uniqueness of the documents is not guaranteed.
This might cause issues while updating from spark.
If you are using the standard _id, the uniqueness will be preserved, however you need to make sure the following options are provided while writing from Spark to Elasticsearch:
es.mapping.join
es.mapping.routing

Solr performance with commitWithin does not make sense

I am running a very simple performance experiment where I post 2000 documents to my application.
Who in tern persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request).
I am testing 3 use cases:
No indexing at all - ~45 sec to post 2000 documents
Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents
Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents
The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI).
I am worried that I am missing something very big. Is it possible that committing after each add will degrade performance by a factor of 400?!
The code I use for point 2:
SolrInputDocument = // get doc
SolrServer solrConnection = // get connection
solrConnection.add(doc);
solrConnection.commit();
Where as the code for point 3:
SolrInputDocument = // get doc
SolrServer solrConnection = // get connection
solrConnection.add(doc, 1); // According to API documentation I understand there is no need to call an explicit commit after this
According to this wiki:
https://wiki.apache.org/solr/NearRealtimeSearch
the commitWithin is a soft-commit by default. Soft-commits are very efficient in terms of making the added documents immediately searchable. But! They are not on the disk yet. That means the documents are being committed into RAM. In this setup you would use updateLog to be solr instance crash tolerant.
What you do in point 2 is hard-commit, i.e. flush the added documents to disk. Doing this after each document add is very expensive. So instead, post a bunch of documents and issue a hard commit or even have you autoCommit set to some reasonable value, like 10 min or 1 hour (depends on your user expectations).

Resources