OpenDJ vlv index error: # Server-side sort failed: Unwilling to Perform - opendj

I'm using OpenDJ 3.0.0 release version.
I have two base dns, 1st is dc=tenant1, 2nd is dc=tenant2, the vlv index I created is based on dc=tenant1, but the ldap search happened on dc=tenant2
Here is the vlv index, which looks like
filter:
(&(objectClass=ns-nationsky-base-subject)(uid=)(cn=))
base dn: dc=tenant1
sort order:uid cn mail
scope: one level
There will be "# Server-side sort failed: Unwilling to Perform" when I try to use ldapsearch with a vlv control, like below:
/ldapsearch -p 1389 -h localhost -D 'cn=Directory Manager' -w 'password' -b 'ou=People,ou=Subjects,dc=tenant2' -G 0:2000:1:0 -s one --sortorder uid "(uid=a)" cn
It all works good but it will always be an error of "# Server-side sort failed: Unwilling to Perform" if there are too many entries in my server.(say 15000)
from the access log , I can see unindexed search
[19/Sep/2016:23:06:38 +0800] SEARCH REQ conn=35 op=1 msgID=2 base="ou=People,ou=Subjects,dc=tenant2" scope=one filter="(uid=a)" attrs="cn"
[19/Sep/2016:23:06:40 +0800] SEARCH RES conn=35 op=1 msgID=2 result=0 nentries=8458 unindexed etime=2543
Any idea how I can fix it ?

A VLV Index and queries are really meant to browse a well know set of entries (like all users) and not varying sets of entries.
So, in order to use a VLV Index, the search request must match the base, the scope, the filter and the sorting order defined for that index (and filters should be constant).
If the VLV index was defined with (&(objectClass=ns-nationsky-base-subject)(uid=)(cn=)), then a search with (uid=a) will not match the index and thus cannot be used.
Server side sorting is a very expensive request, this is why, when there is no index, the server will refuse to sort many entries (governed by index-entry-limit). While it is possible to increase this limit, this has very serious implications in the amount of resources that are used in the server and may seriously impact performances of the server.

Related

elasticsearch warning : this request accesses system indices but in a future major version, direct access to system indices will be prevented

when I send a POST request, I received warning
org.elasticsearch.client.RestClient: request [POST http://localhost:9200/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512]
returned 1 warnings: [299 Elasticsearch-7.14.2-6bc13727ce758c0e943c3c21653b3da82f627f75 "this request accesses system indices: [.apm-agent-configuration, .apm-custom-link, .kibana_7.13.4_001, .kibana_task_manager_7.13.4_001, .tasks], but in a future major version, direct access to system indices will be prevented by default"]
Now, I understand that system indices will be hidden in the future and cannot be accessed. What is the correct usage or command to send so that this warning will not be displayed?
your use of POST http://localhost:9200/_search is querying all indices in Elasticsearch, which you probably don't really want to be doing
you're better off specifying which indices you want to query

ElasticSearch: Result window is too large

My friend stored 65000 documents on the Elastic Search cloud and I would like to retrieve all of them (using python). However, when I am running my current script, there is an error noticing that :
RequestError(400, 'search_phase_execution_exception', 'Result window is too large, from + size must be less than or equal to: [10000] but was [30000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.')
My script
es = Elasticsearch(cloud_id=cloud_id, http_auth=(username, password))
docs = es.search(body={"query": {"match_all": {}}, '_source': ["_id"], 'size': 65000})
What would be the easiest way to retrieve all those document and not limit it to 10000 docs? thanks
The limit has been set so that the result set does not overwhelm your nodes. Results will occupy memory in the elastic node. So bigger the result set, bigger the memory footprint and impact on the nodes.
Depending on what you want to do with the retrieved documents,
try to use the scroll api (as suggested in your error message) if its a batch job. Be mindful of the lifetime of scroll context in that case.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-scroll
or, use the Search After
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-search-after
You should use the scroll API and get the results in different calls. The scroll API will return to you the results 10000 by 10000 as maximum (that will be available to consult during the amount of time you indicate in the call) and you will be able then to paginate the results and obtain them thanks to a scroll_id.
The error message itself is mentioning that how can you solve the issue, look carefully this part of the error message.
This limit can be set by changing the [index.max_result_window] index
level setting.
Please refer update indices level setting on how to change that.
So for your setting it would look like:
PUT /<your-index-name>/_settings
{
"index" : {
"index.max_result_window" : 65000 -> note its equal to your all the docs in your index
}
}

How to get number of current open shards in elsticsearch cluster?

I can't find where to get the number of current open shards.
I want to make monitoring to avoid cases like this:
this cluster currently has [999]/[1000] maximum shards open
I can get maximum limit - max_shards_per_node
$ curl -X GET "${ELK_HOST}/_cluster/settings?include_defaults=true&flat_settings=true&pretty" 2>/dev/null | grep cluster.max_shards_per_node
"cluster.max_shards_per_node" : "1000",
$
But can't find out how to get number of the current open shards (999).
A very simple way to get this information is to call the _cat/shards API and count the number of lines using the wc shell command:
curl -s -XGET ${ELK_HOST}/_cat/shards | wc -l
That will yield a single number that represents the number of shards in your cluster.
Another option is to retrieve the cluster stats using JSON format, pipe the results into jq and then grab whatever you want, e.g. below I'm counting all STARTED shards:
curl -s -XGET ${ELK_HOST}/_cat/shards?format=json | jq ".[].state" | grep "STARTED" | wc -l
Yet another option is to query the _cluster/stats API:
curl -s -XGET ${ELK_HOST}/_cluster/stats?filter_path=indices.shards.total
That will return a JSON with the shard count
{
"indices" : {
"shards" : {
"total" : 302
}
}
}
To my knowledge there is no single number that ES spits out from any API with the single number. To be sure of that, let's look at the source code.
The error is thrown from IndicesService.java
To see how currentOpenShards is computed, we can then go to Metadata.java.
As you can see, the code is iterating over the index metadata that is retrieved from the cluster state, pretty much like running the following command and count the number of shards, but only for indices with "state" : "open"
GET _cluster/state?filter_path=metadata.indices.*.settings.index.number_of*,metadata.indices.*.state
From that evidence, we can pretty much be sure that the single number you're looking for is nowhere to be found, but needs to be computed by one of the methods I showed above. You're free to open a feature request if needed.
The problem: Seems that your elastic cluster number of shards per node are getting limited.
Solution:
Verify the number of shards per node in your configuration and increase it using elastic API.
For getting the number of shards - use _cluster/stats API:
curl -s -XGET 'localhost/_cluster/stats?filter_path=indices.shards.total'
From elastic docs:
The Cluster Stats API allows to retrieve statistics from a cluster
wide perspective. The API returns basic index metrics (shard numbers,
store size, memory usage) and information about the current nodes that
form the cluster (number, roles, os, jvm versions, memory usage, cpu
and installed plugins).
For updating number of shards (increasing/decreasing), use - _cluster/settings api:
For example:
curl -XPUT -H 'Content-Type: application/json' 'localhost:9200/_cluster/settings' -d '{ "persistent" : {"cluster.max_shards_per_node" : 5000}}'
From elastic docs:
With specifications in the request body, this API call can update
cluster settings. Updates to settings can be persistent, meaning they
apply across restarts, or transient, where they don’t survive a full
cluster restart.
You can reset persistent or transient settings by assigning a null
value. If a transient setting is reset, the first one of these values
that is defined is applied:
the persistent setting the setting in the configuration file the
default value. The order of precedence for cluster settings is:
transient cluster settings persistent cluster settings settings in the
elasticsearch.yml configuration file. It’s best to set all
cluster-wide settings with the settings API and use the
elasticsearch.yml file only for local configurations. This way you can
be sure that the setting is the same on all nodes. If, on the other
hand, you define different settings on different nodes by accident
using the configuration file, it is very difficult to notice these
discrepancies.
curl -s '127.1:9200/_cat/indices' | awk '{ if ($2 == "open") C+=$5*$6} END {print C}'
This works:
GET /_stats?level=shards&filter_path=_shards.total
Reference:
https://stackoverflow.com/a/38108448/4271117

Array index out of bound exception while downloading elastic search index

I am trying to download complete elastic search index using:
curl -o output_filename -m 600 -GET 'http://ip/index/_search?q=*&size=7000000'.
But its giving error:
{"error":"ArrayIndexOutOfBoundsException[-131072]","status":500}
How can I download complete index data?
The scroll API is what you're looking for, which supports proper pagination:
Scrolling is not intended for real time user requests, but rather for processing large amounts of data
It's the same /_search endpoint but additional gets passed the ?scroll=<timeout> parameter.
Please be sure to understand what the timeout to e.g. scroll=1m means: it will keep alive your scroll context until you request the next batch/page.
Use the scroll_id from the response to request the next batch/page.

Solr performance with commitWithin does not make sense

I am running a very simple performance experiment where I post 2000 documents to my application.
Who in tern persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request).
I am testing 3 use cases:
No indexing at all - ~45 sec to post 2000 documents
Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents
Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents
The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI).
I am worried that I am missing something very big. Is it possible that committing after each add will degrade performance by a factor of 400?!
The code I use for point 2:
SolrInputDocument = // get doc
SolrServer solrConnection = // get connection
solrConnection.add(doc);
solrConnection.commit();
Where as the code for point 3:
SolrInputDocument = // get doc
SolrServer solrConnection = // get connection
solrConnection.add(doc, 1); // According to API documentation I understand there is no need to call an explicit commit after this
According to this wiki:
https://wiki.apache.org/solr/NearRealtimeSearch
the commitWithin is a soft-commit by default. Soft-commits are very efficient in terms of making the added documents immediately searchable. But! They are not on the disk yet. That means the documents are being committed into RAM. In this setup you would use updateLog to be solr instance crash tolerant.
What you do in point 2 is hard-commit, i.e. flush the added documents to disk. Doing this after each document add is very expensive. So instead, post a bunch of documents and issue a hard commit or even have you autoCommit set to some reasonable value, like 10 min or 1 hour (depends on your user expectations).

Resources