sync/async insert or update ElasticSearch in Python - elasticsearch-py

I'm using ElasticSearch bulk Python API, Does it provide both sync and Async api?

If by sync you mean a blocking operation
In Python, the bulk functions are synchronous. The easiest way to go it through the helper
elasticsearch.helpers.bulk(client, actions, stats_only=False, **kwargs)
it returns a tuple with summary informations. It is thus synchronous.
If by sync you mean consistency
From the bulk api:
When making bulk calls, you can require a minimum number of active shards in the partition through the consistency parameter
In python, the bulk function has a consistency parameter, allowing you to explicit how many shards must have acknowledged the change for the method to return.
If by timeout you mean a way to stop the operation after a while
If you need to limit the duration of a bulk operation, again the low level bulk() function is your friend. It takes a timeout parameter to add an explicit operation timeout.
Even more generally,
Global timeout can be set when constructing the client (see Connection‘s timeout parameter) or on a per-request basis using request_timeout (float value in seconds) as part of any API call
For example:
from elasticsearch import Elasticsearch
es = Elasticsearch()
# only wait for 1 second, regardless of the client's default
es.cluster.health(wait_for_status='yellow', request_timeout=1)
As a side note, I searched for the bulk() call in java and especially the bulk().await(). I couldn't find anything. May I ask you for your source ?

Related

API waiting for a specific record on DynamoDb without pooling

I am inheriting a workflow that has a reasonable amount of data stored in DynamoDb. The data is periodically refreshed by Lambdas calling third parties when needed. The lambdas are triggered by both SQS and DynamoDB streams and go through four or five steps before the data is updated.
I'm given the task to write an API that can forcibly update N items and return their status. The obvious way to do this without reinventing the wheel and honoring DRY is to trigger an event that spawns off a refresh for each item so that the lambdas can do their thing.
The trouble is that I'm not sure the best pub/sub approach to handle being notified that end state of each workflow is met. Do I read from an update/insert stream of dynamodb to see if the records are updated? Do I create some sort of pub/sub model like Reddis or SNS to listen for the end state of each lambda being triggered?
Since I'm writing a REST API, timeouts, if there are failures along the line, arefine. But at the same time I want to make sure I can handle the following.
Be guaranteed that I can be notified that an update occurred for my targets after my call (in the case of multiple forced updates being called at once I only care about the first one to arrive).
Not be bogged down by listening for updates for record updates that are not contextually relevant to the API call in question.
Have an amortized time complexity of 1
In other words, in terms of cap theory i care about C & A but not P (because a 502 isn't that big a deal). But getting the timing wrong or missing a subscription is a problem.
I know I can just listen to a dynamodb event stream but I'm concerned that when things get noisy there will be more irrelevant stuff slowing me down. And I'm not sure if having every single record getting it's own topic is scalable (or how messy that would be).
You can use DynamoDB streams in combination with Lambda Event Filtering so the Lambda function only executes for the relevant change you are interested in. More information is available here:
https://aws.amazon.com/about-aws/whats-new/2021/11/aws-lambda-event-filtering-amazon-sqs-dynamodb-kinesis-sources/

GraphQL Asynchronous query results

I'm trying to implement a batch query interface with GraphQL. I can get a request to work synchronously without issue, but I'm not sure how to approach making the result asynchronous. Basically, I want to be able to kick off the query and return a pointer of sorts to where the results will eventually be when the query is done. I'd like to do this because the queries can sometimes take quite a while.
In REST, this is trivial. You return a 202 and return a Location header pointing to where the client can go to fetch the result. GraphQL as a specification does not seem to have this notion; it appears to always want requests to be handled synchronously.
Is there any convention for doing things like this in GraphQL? I very much like the query specification but I'd prefer to not leave the client HTTP connection open for up to a few minutes while a large query is executed on the backend. If anything happens to kill that connection the entire query would need to be retried, even if the results themselves are durable.
What you're trying to do is not solved easily in a spec-compliant way. Apollo introduced the idea of a #defer directive that does pretty much what you're looking for but it's still an experimental feature. I believe Relay Modern is trying to do something similar.
The idea is effectively the same -- the client uses a directive to mark a field or fragment as deferrable. The server resolves the request but leaves the deferred field null. It then sends one or more patches to the client with the deferred data. The client is able to apply the initial request and the patches separately to its cache, triggering the appropriate UI changes each time as usual.
I was working on a similar issue recently. My use case was to submit a job to create a report and provide the result back to the user. Creating a report takes couple of minutes which makes it an asynchronous operation. I created a mutation which submitted the job to the backend processing system and returned a job ID. Then I periodically poll the jobs field using a query to find out about the state of the job and eventually the results. As the result is a file, I return a link to a different endpoint where it can be downloaded (similar approach Github uses).
Polling for actual results is working as expected but I guess this might be better solved by subscriptions.

How do I close the loop on batched writes in AWS?

I have an endpoint in my api that supports writes. The resource in question is collaborative, so it is reasonable to expect that there will be parallel write requests arriving concurrently.
If the number of writes is small, then this is relatively straight forward to do with a simple lambda - read the current state, compute the new state, compare and swap, spin until the swap succeeds or until we give up. In either case, we compute the appropriate http response and return it to the caller.
If the API is successful, then eventually the waste of conflicting writes becomes expensive enough to address.
It looks as though the natural response is to copy the requests into a queue, with a function that consumes batches; within each batch, we process the requests in sequence, storing the new write, and computing the appropriate response to the request.
What are the options for getting those computed responses copied into the http responses, and what are the trade offs to be be considered?
My sense is that in handling the http request, after (synchronously) enqueue the message, I need to block/poll on something that will eventually be populated with the response to the request.
I'm not sure if this will count an an answer, but I do not agree that the natural response is to copy/queue/block; that feels like you're just trading optimistic concurrency control for a kind of pessimistic one (and you'd probably have an easier time just implementing a lock using e.g. Redis - not to mention there are other issues with Lambda itself that would make the approach you describe even more difficult).
Users probably do not want an API like this as it would have high latency.
In my opinion an API that is well designed for collaborate modification of some shared state has higher order constructs that make the API successful: thinking of a conversation as an example, you would decompose the chat in to individual messages, where each message is in reply to some other message; the concurrent modification to the conversation is append-only for the most part (you might allow a user to edit an individual message but that's not a point of resource contention) and you might do things like count the number of messages within the conversation asynchronously such that it is eventually consistent.
You can look at the domain of your API and see if there's a way to expose modification to it in such a way that reduces contention by making modifications target sub-entities (even if the API represents this as a single resource, the storage engine does not have to).
Another option is looking in to a model like event sourcing, where the changes themselves are literally appended and you derive the state from some snapshot plus recent changes.

Consisntent N1QL Query Couchbase GOCB sdk

I'm currently implementing EventSourcing for my Go Actor lib.
The problem that I have right now is that when an actor restarts and need to replay all it's state from the event journal, the query might return inconsistent data.
I know that I can solve this using MutationToken
But, if I do that, I would be forced to write all events in sequential order, that is, write the last event last.
That way the mutation token for the last event would be enough to get all the data consistently for the specific actor.
This is however very slow, writing about 10 000 events in order, takes about 5 sec on my setup.
If I instead write those 10 000 async, using go routines, I can write all of the data in less than one sec.
But, then the writes are in indeterministic order and I can know which mutation token I can trust.
e.g. Event 999 might be written before Event 843 due to go routine scheduling AFAIK.
What are my options here?
Technically speaking MutationToken and asynchronous operations are not mutually exclusive. It may be able to be done without a change to the client (I'm not sure) but the key here is to take all MutationToken responses and then issue the query with the highest number per vbucket with all of them.
The key here is that given a single MutationToken, you can add the others to it. I don't directly see a way to do this, but since internally it's just a map it should be relatively straightforward and I'm sure we (Couchbase) would take a contribution that does this. At the lowest level, it's just a map of vbucket sequences that is provided to query at the time the query is issued.

How does Parse Query.each count towards execution limits

I am wondering how the each command on a Parse Query counts towards the request execution limits. I am building an app that will need to perform a function on many objects (could be more than 1000) in a parse class.
For example (in JavaScript),
var query = new Parse.Query(Parse.User);
query.equalTo('anObjectIWant',true); //there could be more than 1000 objects I want
query.each(function(object){
doSomething(object); //doSomething does NOT involve another Parse request
});
So, will the above code count as 1 request towards my Parse application execution limit (you get 30/second free), or will each object (each recurrence of calling "each") use one request (so 1000 objects would be 1000 requests)?
I have evaluated the resource usage by observing the number of API requests made by query.each() for different result set sizes. The bottom line is that (at the moment of writing) this function is using the default query result count limit of 100. Thus if your query matches up to 100 results it will make 1 API request, 2 API requests for 101-200 and so forth.
This behavior can not be changed by manually increasing the limit to the maximum using query.limit(1000). If you do this you will get an error when you call query.each() afterwards (this is also mentioned in the documentation).
Therefore it has to be considered to manually implement this functionality (e.g., by recursive query.find()) which allows you to set the query limit to 1000 and thus, in the best case, only consumes one-tenth of the API requests query.each() would consume.
This would count as 1 or 2 depending on :
If it is run from cloudcode function =2,when 1 is for cloudcode call + 1 for query. Since queries get their results all at once it is single call.
If this should be place within "beforeSave" functions or similar then only query would be counted, 1 API call.
So you should be pretty fine as long as you don't trigger another parse API for each result.
I would not be surprised if the .each method would query the server each iteration.
You can actually check this using their "control panel", just look at the amount of requests beeing made.
We left Parse after doing some prototyping, one of the reasons was that while using proper and sugested code from the parse website, I managed to create 6500 requests a day beeing the only one using the app.
Using our own API, we are down to not more than 100.

Resources