opendj, 3m data, ldapsearch --timelimit 60 never return - opendj

With OpenDJ3.0, about 3 millions entries are saved. the entries I saved actually is tokens which has ttl (aka expiry time).
What I'm doing is try to schedule a cron job to periodically search out those expired tokens, and delete them.
I'm using OpenDJ sdk SimplePagedResultsControl to perform a paged search, pageSize=1000, timelimit=60 seconds, search filter is (token-ttl<=20170724234636.576Z)
The search user I'm using is the default "cn=Directory Manager", with default resource limit settings. BTW, the entry-limit for token-ttl index I set is 20000
But in case of two many tokens matched the filter, the search took forever to return.
I tried the ldapsearch utility, the same result.
Any suggestions ?
Thanks

It looks like the search is unindexed and tries to scan the whole database in search of matching entries. Normally your search should hit some limit and be stopped forcefully (like the cursor read limit: once 100000 records have been read in the db).
You may try to add an ordering index on token-ttl attribute?
Search filters hitting this attribute will be faster, however any operation updating this attribute will be a bit slower. That is the trade-off with indexes.

Related

Why is ElasticSearch index searchable when refresh_interval is set to -1 on initial data upload?

I'm performing a large upload of data to an empty index.
This article suggests to set "refresh_interval=-1" and "number_of_replicas=0" to increase upload performance. Then it says to enable it back.
The interesting thing is that if I don't enable it back - I can still send the queries to the newly created index and get the results.
I'd like to know why is that and what I got wrong ? (My expectation was that I should get zero results because indexing is disabled)
And one more thing I'd like to understand - if I enable refresh_interval back to the original value, do I need to execute /_refresh operation ?
By default, Elasticsearch periodically refreshes indices every second,
but only on indices that have received one search request or more in
the last 30 seconds. You can change this default interval using the
index.refresh_interval setting.
so document says: when you send a search request, it will send a refresh request with that. so you could search your data but very slow for first time or miss some data for first search. it is better to have a refresh_interval if you index new data on your indices.

When no write, why Elasticsearch performs indexing every 'n' seconds?

I have basic question regarding elastic search.
As per documentation : By default, Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds.
Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html#refresh-api-desc
Also as per documentation: When a document is stored, it is indexed and fully searchable in near real-time--within 1 second.
Reference : https://www.elastic.co/guide/en/elasticsearch/reference/7.14/documents-indices.html
So when write happens, indexing happen. When write is not happening and documents are already indexed, then why elastic search indexes every 1 second existing documents?
it's not indexing existing documents, that's already been done
it's checking to see if it needs to write any in memory indexing requests that need to be written to disk to make them searchable

How to debug document not available for search in Elasticsearch

I am trying to search and fetch the documents from Elasticsearch but in some cases, I am not getting the updated documents. By updated I mean, we update the documents periodically in Elasticsearch. The documents in ElasticSearch are updated at an interval of 30 seconds, and the number of documents could range from 10-100 Thousand. I am aware that the update is generally a slow process in Elasticsearch.
I am suspecting it is happening because Elasticsearch though accepted the documents but the documents were not available for searching. Hence I have the following questions:
Is there a way to measure the time between indexing and the documents being available for search? There is setting in Elasticsearch which can log more information in Elasticsearch logs?
Is there a setting in Elasticsearch which enables logging whenever the merge operation happens?
Any other suggestion to help in optimizing the performance?
Thanks in advance for your help.
By default the refresh_interval parameter is set to 1 second, so unless you changed this parameter each update will be searchable after maximum 1 second.
If you want to make the results searchable as soon as you have performed the update operation you can use the refresh parameter.
Using refresh=wait_for the endpoint will respond once a refresh has occured. If you use refresh=true a refresh operation will be triggered. Be careful using refresh=true if you have many update since it can impact performances.

ES returns only 10 records by default. How to get all records without using scroll API

When we query ES for records it returns 10 records by default. How can I get all the records in the same query without using any scroll API.
There is an option to specify the size, but size is not known in advance.
You can retrieve up to 10k results in one request (setting "size": 10000). If you have less than 10k matching documents then you can paginate over them using a combination of from/size parameters. If it is more, then you would have to use other methods:
Note that from + size can not be more than the index.max_result_window index setting which defaults to 10,000. See the Scroll or Search After API for more efficient ways to do deep scrolling.
To be able to do pagination over unknown number of documents you will have to get the total count from the first returned query.
Note that if there are concurrent changes in the data, results of paginated retrieval may not be consistent (for example, if one document gets inserted or deleted while you are paginating). Scroll is consistent since it is created from a "snapshot" of the index created at query start time.

Getting an indexes item count with ElasticSearch

I am writing some code where we are inserting 200,000 items into an ElasticSearch index.
Whilst this works fine, when we get a count of items in the index to ascertain everything went in, we are not getting the same number. However, if we wait a second or two, the count is correct.
Therefore, is there a programmatic way we can get a real count from ElasticSearch without having to sleep or similar?
Newly indexed records become visible in search results only after the Refresh operation. Refresh is called automatically with frequency specified by index.refresh_interval setting, which is 1s by default. When writing elasticsearch tests, it's customary to call refresh after indexing to make sure that all indexed records are available in searches. However, excessive refresh calls (after each record, for example) in production code might hamper the elasticsearch indexing performance.

Resources