Posting an update request to ElasticSearch without waiting for completion - elasticsearch

I have an ElasticSearch index that stores files, sometimes very large ones. Because the underlying Lucene engine is actually doing a complete replacement each time a document is updated, even if I am just modifying the value of one field, the entire document needs to be updated behind the scenes.
For large, multi-MB files this can take a fairly long time (several hundred ms). Since this is done as part of a web application this is not really acceptable. What I am doing right now is forking the process, so the update is called on a separate thread while the request finishes.
This works, but I'm not really happy with this as a long term solution, partially because it means that every time I create a new interface to the search engine I'll have to recode the forking logic. Also it means I basically can't know whether the request is successful or not, or if some kind of error occurred, without writing additional code to log successful or unsuccessful requests somewhere.
So I'm wondering if there is an unknown feature where you can post an UPDATE request to ElasticSearch, and have them return an acknowledgement without waiting for the update task to actually complete.
If you look at the documentation for Snapshot and Restore you'll see when you make a request you can add wait_for_completion=true in order to have the entire process run before receiving the result.
What I want is the reverse — the ability to add ?wait_for_completion=false to a POST request.

Related

ElasticSearch document refresh=true does not appear to work

In order to speed up searches on our website, I have created a small elastic search instance which keeps a copy of all of the "searchable" fields from our database. It holds only a couple million documents with an average size of about 1KB per document. Currently (in development) we have just 2 nodes, but will probably want more in production.
Our application is a "primarily read" application - maybe 1000 documents/day get updated, but they get read and searched 10's of thousands of times/day.
Each document represents a case in a ticketing system, and the case may change status during the day as users research and close cases. If a researcher closes a case and then immediately refreshes his queue of open work, we expect the case to disappear from their queue, which is driven by a query to our Elastic Search instance, filtering by status. The status is a field in the case index.
The complaint we're getting is that when a researcher closes a case, upon immediate refresh of his queue, the case still comes back when filtering on "in progress" cases. If he refreshes the view a second or two later, it's gone.
In an effort to work around this, I added refresh=true when updating the document, e.g.
curl -XPUT 'https://my-dev-es-instance.com/cases/_doc/11?refresh=true' -d '{"status":"closed", ... }'
But still the problem persists.
Here's the response I got from the above request:
{"_index":"cases","_type":"_doc","_id":"11","_version":2,"result":"updated","forced_refresh":true,"_shards":{"total":2,"successful":1,"failed":0},"_seq_no":70757,"_primary_term":1}
The response seems to verify that the forced_refresh request was received, although it does say out of total 2 shards, 1 was successful and 0 failed. Not sure about the other one, but since I have only 2 nodes, does this mean it updated the secondary?
According to the doc:
To refresh the shard (not the whole index) immediately after the operation occurs, so that the document appears in search results immediately, the refresh parameter can be set to true. Setting this option to true should ONLY be done after careful thought and verification that it does not lead to poor performance, both from an indexing and a search standpoint. Note, getting a document using the get API is completely realtime and doesn’t require a refresh.
Are my expectations reasonable? Is there a better way to do this?
After more testing, I have concluded that my issue was due to application logic error, and not a problem with ElasticSearch. The refresh flag is behaving as expected. Apologies for the misinformation.

Elasticsearch high level REST client - Indexing has latency

we have started using the high level REST client finally, to ease the development of queries from backend engineering perspective. For indexing, we are using the client.update(request, RequestOptions.DEFAULT) so that new documents will be created and existing ones modified.
The issue that we are seeing is, the indexing is delayed, almost by 5 minutes. I see that they use async http calls internally. But that should not take so long, I looked for some timing options inside the library, didn't find anything. Am I missing anything or the official documentation is missing for this?
Since refresh_interval: 1 in your index settings, it means it is never refreshed unless you do it manually, which is why you don't see the data just after it's been updated.
You have three options here:
A. You can call the _update endpoint with the refresh=true (or refresh=wait_for) parameter to make sure that the index is refreshed just after your update.
B. You can simply set refresh_interval: 1s (or any other duration that makes sense for you) in your index settings, to make sure the index is automatically refreshed on a regular basis.
C. You can explicitly call index/_refresh on your index to refresh it whenever you think is appropriate.
Option B is the one that usually makes sense in most use cases.
Several reference on using the refresh wait_for but I had a hard time finding what exactly needed to be done in the rest high level client.
For all of you that are searching this answer:
IndexRequest request = new IndexRequest(index, DOC_TYPE, id);
request.setRefreshPolicy(WriteRequest.RefreshPolicy.WAIT_UNTIL);

GraphQL Asynchronous query results

I'm trying to implement a batch query interface with GraphQL. I can get a request to work synchronously without issue, but I'm not sure how to approach making the result asynchronous. Basically, I want to be able to kick off the query and return a pointer of sorts to where the results will eventually be when the query is done. I'd like to do this because the queries can sometimes take quite a while.
In REST, this is trivial. You return a 202 and return a Location header pointing to where the client can go to fetch the result. GraphQL as a specification does not seem to have this notion; it appears to always want requests to be handled synchronously.
Is there any convention for doing things like this in GraphQL? I very much like the query specification but I'd prefer to not leave the client HTTP connection open for up to a few minutes while a large query is executed on the backend. If anything happens to kill that connection the entire query would need to be retried, even if the results themselves are durable.
What you're trying to do is not solved easily in a spec-compliant way. Apollo introduced the idea of a #defer directive that does pretty much what you're looking for but it's still an experimental feature. I believe Relay Modern is trying to do something similar.
The idea is effectively the same -- the client uses a directive to mark a field or fragment as deferrable. The server resolves the request but leaves the deferred field null. It then sends one or more patches to the client with the deferred data. The client is able to apply the initial request and the patches separately to its cache, triggering the appropriate UI changes each time as usual.
I was working on a similar issue recently. My use case was to submit a job to create a report and provide the result back to the user. Creating a report takes couple of minutes which makes it an asynchronous operation. I created a mutation which submitted the job to the backend processing system and returned a job ID. Then I periodically poll the jobs field using a query to find out about the state of the job and eventually the results. As the result is a file, I return a link to a different endpoint where it can be downloaded (similar approach Github uses).
Polling for actual results is working as expected but I guess this might be better solved by subscriptions.

Check if S3 file has been modified

How can I use a shell script check if an Amazon S3 file ( small .xml file) has been modified. I'm currently using curl to check every 10 seconds, but it's making many GET requests.
curl "s3.aws.amazon.com/bucket/file.xml"
if cmp "file.xml" "current.xml"
then
echo "no change"
else
echo "file changed"
cp "file.xml" "current.xml"
fi
sleep(10s)
Is there a better way to check every 10 seconds that reduces the number of GET requests? (This is built on top of a rails app so i could possibly build a handler in rails?)
Let me start by first telling you some facts about S3. You might know this, but in case you don't, you might see that your current code could have some "unexpected" behavior.
S3 and "Eventual Consistency"
S3 provides "eventual consistency" for overwritten objects. From the S3 FAQ, you have:
Q: What data consistency model does Amazon S3 employ?
Amazon S3 buckets in all Regions provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES.
Eventual consistency for overwrites means that, whenever an object is updated (ie, whenever your small XML file is overwritten), clients retrieving the file MAY see the new version, or they MAY see the old version. For how long? For an unspecified amount of time. It typically achieves consistency in much less than 10 seconds, but you have to assume that it will, eventually, take more than 10 seconds to achieve consistency. More interestingly (sadly?), even after a successful retrieval of the new version, clients MAY still receive the older version later.
One thing that you can be assured of is: if a client starts download a version of the file, it will download that entire version (in other words, there's no chance that you would receive for example, the first half of the XML file as the old version and the second half as the new version).
With that in mind, notice that your script could fail to identify the change within your 10-second timeframe: you could make multiple requests, even after a change, until your script downloads a changed version. And even then, after you detect the change, it is (unfortunately) entirely possible the the next request would download the previous (!) version, and trigger yet another "change" in your code, then the next would give the current version, and trigger yet another "change" in your code!
If you are OK with the fact that S3 provides eventual consistency, there's a way you could possibly improve your system.
Idea 1: S3 event notifications + SNS
You mentioned that you thought about using SNS. That could definitely be an interesting approach: you could enable S3 event notifications and then get a notification through SNS whenever the file is updated.
How do you get the notification? You would need to create a subscription, and here you have a few options.
Idea 1.1: S3 event notifications + SNS + a "web app"
If you have a "web application", ie, anything running in a publicly accessible HTTP endpoint, you could create an HTTP subscriber, so SNS will call your server with the notification whenever it happens. This might or might not be possible or desirable in your scenario
Idea 2: S3 event notifications + SQS
You could create a message queue in SQS and have S3 deliver the notifications directly to the queue. This would also be possible as S3 event notifications + SNS + SQS, since you can add a queue as a subscriber to an SNS topic (the advantage being that, in case you need to add functionality later, you could add more queues and subscribe them to the same topic, therefore getting "multiple copies" of the notification).
To retrieve the notification you'd make a call to SQS. You'd still have to poll - ie, have a loop and call GET on SQS (which cost about the same, or maybe a tiny bit more depending on the region, than S3 GETs). The slight difference is that you could reduce a bit the number of total requests -- SQS supports long-polling requests of up to 20 seconds: you make the GET call on SQS and, if there are no messages, SQS holds the request for up to 20 seconds, returning immediately if a message arrives, or returning an empty response if no messages are available within those 20 seconds. So, you would send only 1 GET every 20 seconds, to get faster notifications than you currently have. You could potentially halve the number of GETs you make (once every 10s to S3 vs once every 20s to SQS).
Also - you could chose to use one single SQS queue to aggregate all changes to all XML files, or multiple SQS queues, one per XML file. With a single queue, you would greatly reduce the overall number of GET requests. With one queue per XML file, that's when you could potentially "halve" the number of GET request as compared to what you have now.
Idea 3: S3 event notifications + AWS Lambda
You can also use a Lambda function for this. This could require some more changes in your environment - you wouldn't use a Shell Script to poll, but S3 can be configured to call a Lambda Function for you as a response to an event, such as an update on your XML file. You could write your code in Java, Javascript or Python (some people devised some "hacks" to use other languages as well, including Bash).
The beauty of this is that there's no more polling, and you don't have to maintain a web server (as in "idea 1.1"). Your code "simply runs", whenever there's a change.
Notice that, no matter which one of these ideas you use, you still have to deal with eventual consistency. In other words, you'd know that a PUT/POST has happened, but once your code sends a GET, you could still receive the older version...
Idea 4: Use DynamoDB instead
If you have the ability to make a more structural change on the system, you could consider using DynamoDB for this task.
The reason I suggest this is because DynamoDB supports strong consistency, even for updates. Notice that it's not the default - by default, DynamoDB operates in eventual consistency mode, but the "retrieval" operations (GetItem, for example), support fully consistent reads.
Also, DynamoDB has what we call "DynamoDB Streams", which is a mechanism that allows you to get a stream of changes made to any (or all) items on your table. These notifications can be polled, or they can even be used in conjunction with a Lambda function, that would be called automatically whenever a change happens! This, plus the fact that DynamoDB can be used with strong consistency, could possibly help you solve your problem.
In DynamoDB, it's usually a good practice to keep the records small. You mentioned in your comments that your XML files are about 2kB - I'd say that could be considered "small enough" so that it would be a good fit for DynamoDB! (the reasoning: DynamoDB reads are typically calculated as multiples of 4kB; so to fully read 1 of your XML files, you'd consume just 1 read; also, depending on how you do it, for example using a Query operation instead of a GetItem operation, you could possibly be able to read 2 XML files from DynamoDB consuming just 1 read operation).
Some references:
http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
http://docs.aws.amazon.com/lambda/latest/dg/with-ddb.html
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_ReceiveMessage.html
I can think of another way by using S3 Versioning; this would require the least amount of changes to your code.
Versioning is a means of keeping multiple variants of an object in the same bucket.
This would mean that every time a new file.xml is uploaded, S3 will create a new version.
In your script, instead of getting the object and comparing it, get the HEAD of the object which contains the VersionId field. Match this version with the previous version to find out if the file has changed.
If the file has indeed changed, get the new file, and also get the new version of that file and save it locally so that next time you can use this version to check if a newer-newer version has been uploaded.
Note 1: You will still be making lots of calls to S3, but instead of fetching the entire file every time, you are only fetching the metadata of the file which is much faster and smaller in size.
Note 2: However, if your aim was to reduce the number of calls, the easiest solution I can think of is using lambdas. You can trigger a lambda function every time a file is uploaded that then calls the REST endpoint of your service to notify you of the file change.
You can use --exact-timestamps
see AWS discussion
https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
Instead of using versioning, you can simply compare the E-Tag of the file, which is available in the header, and is similar to the MD-5 hash of the file (and is exactly the MD-5 hash if the file is small, i.e. less than 4 MB, or sometimes even larger. Otherwise, it is the MD-5 hash of a list of binary hashes of blocks.)
With that said, I would suggest you look at your application again and ask if there is a way you can avoid this critical path.

mongodb many inserts\updates performance

I am using mongodb to store user's events, there's a document for every user, containing an array of events. The system processes thousands of events a minute and inserts each one of them to mongo.
The problem is that I get poor performance for the update operation, using a profiler, I notice that the WriteResult.getError is the one that incur the performance impact.
That makes sense, the update is async, but if one wants to retrieve the operation result he needs to wait until the operation is completed.
My question, is there a way to keep the update async, but only get an exception if error occurs (99.999 of the times there is no error, so the system waits for nothing). I understand it means the exception will be raised somewhere further down the process flow, but I can live with that.
Any other suggestions?
The application is written in Java so we're using the Java driver, but I am not sure it's related.
have you done indexing on your records?
it may be a problem to your performance.
if not done before you should do Indexing on ur collection like
db.collectionName.ensureIndex({"event.type":1})
for more help visit http://www.mongodb.org/display/DOCS/Indexes

Resources