elasticsearch warning : this request accesses system indices but in a future major version, direct access to system indices will be prevented - spring-boot

when I send a POST request, I received warning
org.elasticsearch.client.RestClient: request [POST http://localhost:9200/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512]
returned 1 warnings: [299 Elasticsearch-7.14.2-6bc13727ce758c0e943c3c21653b3da82f627f75 "this request accesses system indices: [.apm-agent-configuration, .apm-custom-link, .kibana_7.13.4_001, .kibana_task_manager_7.13.4_001, .tasks], but in a future major version, direct access to system indices will be prevented by default"]
Now, I understand that system indices will be hidden in the future and cannot be accessed. What is the correct usage or command to send so that this warning will not be displayed?

your use of POST http://localhost:9200/_search is querying all indices in Elasticsearch, which you probably don't really want to be doing
you're better off specifying which indices you want to query

Related

Error working with "ScrollElasticSearchHttp" processor in NiFi

I am trying to retrieve data from an index in ElasticSearch. I configured the "QueryElasticSearchHttp" processor and it works just fine. However when I try to use the ScrollElasticsearchHttp processor with the same URL, query, index properties and set the 'scroll' to default 1 minute, it doesn't work.
I get an error response of 404 : "Elasticsearch returned code 404 with message Not found".
I am also tailing the log on the ES cluster and I see this error;
[DEBUG][o.e.a.s.TransportSearchScrollAction] [2] Failed to execute query phase
org.elasticsearch.transport.RemoteTransportException:[127.0.0.1:9300][indices:data/read/search[phase/query+fetch/scroll]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [2]
at org.elasticsearch.search.SearchService.getExecutor(SearchService.java:457) ~[elasticsearch-7.5.2.jar:7.5.2]
I am on Apache NiFi 1.10.0
Here is the config for the processor:
I should see a total of 441 hits, and with page size 20 I should see 23 queries being made to ES.
But I don't get a single result back. I have tried higher values for "scroll" and also played around with "page size" to no avail.
I also noticed that even though the ScrollElasticsearchHttp processor is set to run every 1m, on the ES log I don't see any error log repeated every minute.
Update:
When I cleared the state via UI: "View state" -> "Clear State", I was able to make a single call, that returned a page full of hits in one flowfile.
However, there are more pages to be retrieved. How do I make the processor to go fetch the next page?
My understanding was that the single invocation of the ScrollElasticsearchHttp will page through all the result sets and bring in each page as one flowfile. Is this not correct?
Please decrease the scheduling time to around 10-20 sec. So in every 10-20 sec processor will fetch the next set of records based on your page size.
You can check the state value when the fetching process is in progress i.e. you will find a scroll id in it. Once the fetching process is complete then state value will be changed to "finishedQuery" : true.

Elasticsearch Delete by Query Version Conflict

I am using Elasticsearch version 5.6.10. I have a query that deletes records for a given agency, so they can later be updated by a nightly script.
The query is in elasticsearch-dsl and look like this:
def remove_employees_from_search(jurisdiction_slug, year):
s = EmployeeDocument.search()
s = s.filter('term', year=year)
s = s.query('nested', path='jurisdiction', query=Q("term", **{'jurisdiction.slug': jurisdiction_slug}))
response = s.delete()
return response
The problem is I am getting a ConflictError exception when trying to delete the records via that function. I have read this occurs because the documents were different between the time the delete process started and executed. But I don't know how this can be, because nothing else is modifying the records during the delete process.
I am going to add s = s.params(conflicts='proceed') in order to silence the exception. But this is a band-aid as I do not understand why the delete is not processing as expected. Any ideas on how to troubleshoot this? A snapshot of the error is below:
ConflictError:TransportError(409,
u'{
"took":10,
"timed_out":false,
"total":55,
"deleted":0,
"batches":1,
"version_conflicts":55,
"noops":0,
"retries":{
"bulk":0,
"search":0
},
"throttled_millis":0,
"requests_per_second":-1.0,
"throttled_until_millis":0,
"failures":[
{
"index":"employees",
"type":"employee_document",
"id":"24681043",
"cause":{
"type":"version_conflict_engine_exception",
"reason":"[employee_document][24681043]: version conflict, current version [5] is different than the one provided [4]",
"index_uuid":"G1QPF-wcRUOCLhubdSpqYQ",
"shard":"0",
"index":"employees"
},
"status":409
},
{
"index":"employees",
"type":"employee_document",
"id":"24681063",
"cause":{
"type":"version_conflict_engine_exception",
"reason":"[employee_document][24681063]: version conflict, current version [5] is different than the one provided [4]",
"index_uuid":"G1QPF-wcRUOCLhubdSpqYQ",
"shard":"0",
"index":"employees"
},
"status":409
}
You could try making it do a refresh first
client.indices.refresh(index='your-index')
source https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference.html#_indices_refresh
First, this is a question that was asked 2 years ago, so take my response with a grain of salt due to the time gap.
I am using the javascript API, but I would bet that the flags are similar. When you index or delete there is a refresh flag which allows you to force the index to have the result appear to search.
I am not an Elasticsearch guru, but the engine must perform some systematic maintenance on the indices and shards so that it moves the indices to a stable state. It's probably done over time, so you would not necessarily get an immediate state update. Furthermore, from personal experience, I have seen when delete does not seemingly remove the item from the index. It might mark it as "deleted", give the document a new version number, but it seems to "stick around" (probably until general maintenance sweeps run).
Here I am showing the js API for delete, but it is the same for index and some of the other calls.
client.delete({
id: string,
index: string,
type: string,
wait_for_active_shards: string,
refresh: 'true' | 'false' | 'wait_for',
routing: string,
timeout: string,
if_seq_no: number,
if_primary_term: number,
version: number,
version_type: 'internal' | 'external' | 'external_gte' | 'force'
})
https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference.html#_delete
refresh
'true' | 'false' | 'wait_for' - If true then refresh the affected shards to make this operation visible to search, if wait_for then wait for a refresh to make this operation visible to search, if false (the default) then do nothing with refreshes.
For additional reference, here is the page on Elasticsearch refresh info and what might be a fairly relevant blurb for you.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html
Use the refresh API to explicitly refresh one or more indices. If the request targets a data stream, it refreshes the stream’s backing indices. A refresh makes all operations performed on an index since the last refresh available for search.
By default, Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds. You can change this default interval using the index.refresh_interval setting.

409 error when using streaming_bulk() - certain that document is only included once.

I am attempting to upload a large number of documents - about 7 million.
I have created actions for each document to be added and split them up into about 260 files, about 30K documents each.
Here is the format of the actions:
a = someDocument with nesting
esActionFromFile = [{
'_index': 'mt-interval-test-9',
'_type': 'doc',
'_id': 5641254,
'_source': a,
'_op_type': 'create'}]
I have tried using helpers.bulk, helpers.parallel_bulk, and helpers.streaming_bulk and have had partial success using helpers.bulk and helpers.streaming_bulk.
Each time I run a test, I delete, and then recreate the index using:
# Refresh Index
es.indices.delete(index=index, ignore=[400, 404])
es.indices.create(index = index, body = mappings_request_body)
When I am partially successful - many documents are loaded, but eventually I get a 409 version conflict error.
I am aware that there can be version conflicts created when there has not been sufficient time for ES to process the deletion of individual documents after doing a delete by query.
At first, I thought that something similar was happening here. However, I realized that I am often getting the errors from files the first time they have ever been processed (i.e. even if the deletion was causing issues, this particular file had never been loaded, so there couldn't be a conflict).
The _id value I am using is the primary key from the original database where I am extracting the data from - so I am certain they are unique. Furthermore, I have checked whether there was unintentional duplication of records in my actions arrays, or the files I created them from, and there are no duplicates.
I am at a loss to explain why this is happening, and struggling to find a solution to upload my data.
Any assistance would be greatly appreciated!
There should be information attached to the 409 response that should tell you exactly what's going wrong and which document caused it.
Another thing that could cause this would be a retry - when elasticsearch-py cannot connect to the cluster it will resend the request again to a different node. In some complex scenarios it can happen that a request will be thus sent twice. This is especially true if you enabled retry_on_timeout option.

Array index out of bound exception while downloading elastic search index

I am trying to download complete elastic search index using:
curl -o output_filename -m 600 -GET 'http://ip/index/_search?q=*&size=7000000'.
But its giving error:
{"error":"ArrayIndexOutOfBoundsException[-131072]","status":500}
How can I download complete index data?
The scroll API is what you're looking for, which supports proper pagination:
Scrolling is not intended for real time user requests, but rather for processing large amounts of data
It's the same /_search endpoint but additional gets passed the ?scroll=<timeout> parameter.
Please be sure to understand what the timeout to e.g. scroll=1m means: it will keep alive your scroll context until you request the next batch/page.
Use the scroll_id from the response to request the next batch/page.

Solr performance with commitWithin does not make sense

I am running a very simple performance experiment where I post 2000 documents to my application.
Who in tern persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request).
I am testing 3 use cases:
No indexing at all - ~45 sec to post 2000 documents
Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents
Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents
The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI).
I am worried that I am missing something very big. Is it possible that committing after each add will degrade performance by a factor of 400?!
The code I use for point 2:
SolrInputDocument = // get doc
SolrServer solrConnection = // get connection
solrConnection.add(doc);
solrConnection.commit();
Where as the code for point 3:
SolrInputDocument = // get doc
SolrServer solrConnection = // get connection
solrConnection.add(doc, 1); // According to API documentation I understand there is no need to call an explicit commit after this
According to this wiki:
https://wiki.apache.org/solr/NearRealtimeSearch
the commitWithin is a soft-commit by default. Soft-commits are very efficient in terms of making the added documents immediately searchable. But! They are not on the disk yet. That means the documents are being committed into RAM. In this setup you would use updateLog to be solr instance crash tolerant.
What you do in point 2 is hard-commit, i.e. flush the added documents to disk. Doing this after each document add is very expensive. So instead, post a bunch of documents and issue a hard commit or even have you autoCommit set to some reasonable value, like 10 min or 1 hour (depends on your user expectations).

Resources