When janusgraph Mixed index is indexed? - janusgraph

I use Janusgraph 0.2.0 and ES backend.
when indexed?
Once a vertex or edge is added and the transaction is successfully committed, can I consider that the mixed indexes related to the transaction are immediately available?
Or, the mixed indexes are lazily sent to backend by janusgraph after the commit (that is, eventually consistent)?
when mixed index is down
If mixed indexes are lazily sent to backend by janusgraph, when the mixed index backend is down, can I successfully commit a transaction which requires mixed indexes?
index status
If mixed indexes are lazily sent to backend indexed by, how to check the indexing state (index lag)?

JanusGraph commit index backend's mutations at the time of transaction commit
There is a configuration in JanusGraph
storage.write-time : default value 100000 ms
Maximum time (in ms) to wait for a backend write operation to complete successfully. If a backend write operationfails temporarily, JanusGraph will backoff exponentially and retry the operation until the wait time has been exhausted.
The class IndexTransaction Wraps the transaction handle of an index and buffers all mutations against an index for efficiency. It will retry until the storage.write-time exceeded. After exceeding throws BackendException

Related

Elasticsearch multiple index request on same document id in bulk api

We are using elastic search 6.0 and using bulk indexing to index many documents in a single request using “index” action. In a single request we can have a scenario where there are multiple “index” requests on same document. Will ES fail the bulk request in such case OR it will process all of them in order?
Edit1: I use a script for indexing in bulk request where we are handling out of order updates. So as long as all “index” requests are getting processed, we don’t have any issue.
ES will not fail, but it is not necessarily clear which indexing operation will "win". It might be the last one but since all operations in the bulk batch might be spread over several ingest nodes, and not all of those nodes process the indexing operations at the same rate, it might not be clear which operation will be processed first and which will be processed last.
The only guarantee that you have is that in the response, you'll get the state of each operation in the same order as specified in the request batch.
If your index has only one primary shard, then the order in which you submit the operations will be the same order as the one those operations are processed, hence the last one wins, but if you have more than one primary shard on more than one node, then you can't really know.
A better question would be why do you submit several indexing operations per document knowing in advance that only one will win?

Elasticsearch delete_by_query version conflict

According to ES documentation document indexing/deletion happens as follows:
Request received at one of the nodes.
Request forwarded to the document's primary shard.
The operation performed on the primary shard and parallel requests sent to replica nodes.
Primary shard node waits for a response from replica nodes and then send the response to the node where the request was originally received.
Send the response back to the client.
Now in my case, I am sending a create document request to ES at time t and then sending a request to delete the same document (using delete_by_query) at approximately t+800 milliseconds. These requests are sent via a messaging system (internal implementation of kafka) which ensures that the delete request will be sent to ES only after receiving 200 OK response for the indexing operation from ES.
According to ES documentation, delete_by_query throws a 409 version conflict only when the documents present in the delete query have been updated during the time delete_by_query was still executing.
In my case, it is always guaranteed that the delete_by_query request will be sent to ES only when a 200 OK response has been received for all the documents that have to be deleted. Hence there is no possibility of an update/create of a document that has to be deleted during delete_by_query operation.
Please let me know if I am missing something or this is an issue with ES.
Possible reason could be due to the fact that when a document is created, it is not "committed" to the index immediately.
Elasticsearch indices operate on a refresh_interval, which defaults to 1 second.
This documentation around refresh cycles is old, but I cannot for the life of me find anything as descriptive in the more modern ES versions.
A few things you can try:
Send _refresh with your request
Add ?refresh=wait_for or ?refresh=true param
Note that refreshing the index on every indexing request is terrible for performance, which begs the question as to why you are trying to delete a document immediately after indexing it.
add
deleteByQueryRequest.setAbortOnVersionConflict(false);

LRU Caching policy of a Node Query cache in Elasticsearch

I have a elasticsearch cluster set up with node query cache enabled, I have set the size of the cache to be 2gb, but I am not completely sure how the LRU caching policy works in this case.
I have a query context run against the elasticsearch index and i expect the result to be cached, so that when there is request for the same query context again - there should be increase in the hit_count, but this is not the behavior i see in ES.
These are the stats of my query_cache
memory_size_in_bytes: 7176480,
total_count: 36605,
hit_count: 15657,
miss_count: 20948,
cache_size: 130,
cache_count: 130,
evictions: 0
Even though the memory_size_in_bytes is not reached its max. The result of the query context is not completely cached and when the same query context is fired against the elasticsearch index i see miss counts stats getting increased rather than hit counts.
Can anyone please explain about how the node query caching works in ES.

Bulk API order of execution Elasticsearch

If I have requests 1,2,3 in the bulk API of elasticsearch, am I guaranteed that it is executed sequentially, i.e 1 first then 2 and then 3?
This article says that
Each subrequest is executed independently, so the failure of one subrequest won’t affect the success of the others.
This implies that you should not count on the order of the requests, because some of them might not finish successfully at all.
However, the response contains the status for each subrequest in the same order as they were submitted.
Also note that the index is refreshed only 1/sec (by default), so i would expect that individual subrequests would not see the changes of other operations from the same batch.
After reading the source code, we've found that, for operations upon the same doc id, the order can be assured. Because Elasticsearch server will first sort the bulk request and group them by Shard. Then distributed requests will be sending to those shards. Once a shard receives a Shard bulk request, it will execute the requests one by one.

Issues with ElasticSearch for real-time geo queries

I'm building a service that will allow users to search for other users who are nearby, based on GPS coordinates. I've tried using ElasticSearch's geo spatial indexes. When a user signs in, he submits his GPS location to an ElasticSearch geo index. Other users periodically poll ElasticSearch, querying for new documents that contain GPS coordinates within a few hundred meters.
The problem is that ElasticSearch either doesn't update its index fast enough, or it caches its results, making it unsuitable for retrieving real-time results. I've tried disabling the cache with index.cache.filter.max_size=-1 and passing "_cache=false" with every query. ElasticSearch still returns stale results when polling with the same query, and it can return stale results for up to a few minutes.
Any idea on what could be happening? Maybe it's because I'm keeping the same connection open during polling, and ElasticSearch caches results for each connection? Still, the results can be out of date with subsequent requests.
Elasticsearch results don't become immediately available for search. They are accumulated in a buffer and become available only after operation called refresh. In other words, search is not real time, but "near real time" operation ("near" is because refresh is called every second by default). Please also note that get operation is real-time - you can get document immediately after it is indexed.
While you can force refresh process after each document or make it more often, it's not the best solution for your problem because very frequent refreshing can significantly reduce search and indexing performance. Instead, I would advise you to check Elasticsearch percolators, which were added exactly for the use cases such as yours.

Resources