I have a document in ES. There is field A which currently has value {"Value1"}. Now two process starts in parallel, such that both tries to append some values to the field "A".
Lets say if both processes would not have been concurrent then the field "A" would have been like : {"Value1Value2Value3"} or {"Value1Value3Value2"}, i.e one request tries to append "Value2" and another request tries to append "Value3".But for concurrent requests, how to handle this case?
I would strongly suggest to read the official blog on versioning support to understand how elasticsearch handles the concurrent updates to same doc.
Hint it uses the optimistic locking to improve the speed and you can use both internal or external versioning of your documents with below warning, but update API supports automatic retry in case of version conflict.
update and update_by_query do not work with internal versioning.
Please refer update_by_query for further read.
Related
I'm trying to implement a batch query interface with GraphQL. I can get a request to work synchronously without issue, but I'm not sure how to approach making the result asynchronous. Basically, I want to be able to kick off the query and return a pointer of sorts to where the results will eventually be when the query is done. I'd like to do this because the queries can sometimes take quite a while.
In REST, this is trivial. You return a 202 and return a Location header pointing to where the client can go to fetch the result. GraphQL as a specification does not seem to have this notion; it appears to always want requests to be handled synchronously.
Is there any convention for doing things like this in GraphQL? I very much like the query specification but I'd prefer to not leave the client HTTP connection open for up to a few minutes while a large query is executed on the backend. If anything happens to kill that connection the entire query would need to be retried, even if the results themselves are durable.
What you're trying to do is not solved easily in a spec-compliant way. Apollo introduced the idea of a #defer directive that does pretty much what you're looking for but it's still an experimental feature. I believe Relay Modern is trying to do something similar.
The idea is effectively the same -- the client uses a directive to mark a field or fragment as deferrable. The server resolves the request but leaves the deferred field null. It then sends one or more patches to the client with the deferred data. The client is able to apply the initial request and the patches separately to its cache, triggering the appropriate UI changes each time as usual.
I was working on a similar issue recently. My use case was to submit a job to create a report and provide the result back to the user. Creating a report takes couple of minutes which makes it an asynchronous operation. I created a mutation which submitted the job to the backend processing system and returned a job ID. Then I periodically poll the jobs field using a query to find out about the state of the job and eventually the results. As the result is a file, I return a link to a different endpoint where it can be downloaded (similar approach Github uses).
Polling for actual results is working as expected but I guess this might be better solved by subscriptions.
The graphql spec mentions that you can send multiple operations in request document as long as you select the operation to run:
http://facebook.github.io/graphql/October2016/#ExecuteRequest()
To execute a request, the executor must have a parsed Document (as defined in the “Query Language” part of this spec) and a selected operation name to run if the document defines multiple operations, otherwise the document is expected to only contain a single operation.
In what situation, would you send such a document with multiple operations? I can only assume this is to allow for switching between operations without having to create a separate document for each one?
I found this answer from the graphql team:
The use is threefold:
It's often useful to store GraphQL documents in a client side file; we use .graphql files on iOS and Android for this. We wanted the parser to work on these files as well (so the parser is universal), but you can (and we often do) include multiple queries in a single document.
One optimization that can be made (and that we'll discuss in more detail in the future, I'm sure) is "persisting" documents; you send the document to the server who stores it for you and gives you an identifier for that, that way you don't have to send the whole document string up every time. If you do that, you can send a document up with all your queries, and then pass the document ID + operation name over the wire. This would optimize bytes being sent from the client to the server.
It's possible to write a "batch" API for GraphQL, where you use the results from one query as the parameters to another. We're still working out the exact details there, but if you do that, it's useful to specify multiple queries in a single document; you'd specify which one you wanted to run at the start and the relationships between them.
Reference: https://github.com/facebook/graphql/issues/29#issuecomment-118994619
I need to process a list of files based on the result of a MongoDB query, but I can't find any processor that would let me do that. I basically have to take each file and process it or completely discard based on the result of a query that involves that file attributes.
The only MongoDB-related processor that I see in NiFi 1.50 is GetMongo, which apparently can't receive connections, but only emit FlowFiles based on the configured parameters.
Am I looking in the wrong place?
NIFI-4827 is an Improvement Jira that aims to allow GetMongo to accept incoming flow files, the content would contain the query, and the properties will accept Expression Language. The code is still under review, but the intent is to make it available in the upcoming NiFi 1.6.0 release.
As a possible workaround in the meantime, if there is a REST API you could use InvokeHttp to make the call(s) manually and parse the result(s). Also if you have a JDBC driver for MongoDB (such as Unity), you might be able to use ExecuteSQL.
I'm trying to solve one task and will appreciate any help - links to documentation, or links to forums, or other FAQs besides https://cwiki.apache.org/confluence/display/NIFI/FAQs, or any meaningful answer in this post =) .
So, I have the following task:
Initial part of my system collects data each 5-15 min from different DB sources. Then I remove duplicates, remove junk, combine data from different sources according to logic and then redirect it to second part of the system as several streams.
As far as I know, "NiFi" can do this task in the best way =).
Currently I can successfully get information from InfluxDB by "GetHTTP" processor. However I can't configure same kind of processor for getting information from Elastic DB with all necessary options. I'd like to receive data each 5-15 minutes for time period from "now-minus-<5-15 minutes>" to "now". (depends on scheduler period) with several additional filters. If I understand it right, this can be achieved either by subscription to "_index" or by regular requests to DB with desired interval.
I know that NiFi has several specific Processors designed for Elasticsearch (FetchElasticsearch5, FetchElasticsearchHttp, QueryElasticsearchHttp, ScrollElasticsearchHttp) as well as GetHTTP and PostHTTP Processors. However, unfortunately, I have lack of information or even better - examples - how to configure their "Properties" for my purposes =(.
What's the difference between FetchElasticsearchHttp, QueryElasticsearchHttp? Which one fits better for my task? What's the difference between GetHTTP and QueryElasticsearchHttp besides several specific fields? Will GetHTTP perform the same way if I tune it as I need?
Any advice?
I will be grateful for any help.
The ElasticsearchHttp processors try to make it easier to interact with ES by generating the appropriate REST API call based on the properties you set. If you know the full URL you need, you could use GetHttp or InvokeHttp. However the ESHttp processors let you put in just the stuff you're looking for, and it will generate the URL and return the results.
FetchElasticsearch (and its variants) is used to get a particular document when you know the identifier. This is sometimes used after a search/query, to return documents one at a time after you know which ones you want.
QueryElasticsearchHttp is for when you want to do a Lucene-style query of the documents, when you don't necessarily know which documents you want. It will only return up to the value of index.max_result_window for that index. To get more records, you can use ScrollElasticsearchHttp afterwards. NOTE: QueryElasticsearchHttp expects a query that will work as the "q" parameter of the URL. This "mini-language" does not support all fields/operators (see here for more details).
For your use case, you likely need InvokeHttp in order to issue the kind of query you describe. This article describes how to issue a query for the last 15 minutes. Once your results are returned, you might need some combination of EvaluateJsonPath and/or SplitJson to work with the individual documents, see the Elasticsearch REST API documentation (and NiFi processor documentation) for more details.
I have a RabbitMQ broker, on which I post different messages that will end up as documents in Elasticsearch. There are multiple consumers from the broker, which are actually different threads in a task executor assigned to an amqp inbound gateway (using spring integration and spring amqp here).
Think at the following scenario: I have created a doc in ES with the structure
{
"field1" : "value1",
"field2" : "value2"
}
Afterwards I send two update requests, both updating the same field, let's say field1. If I send this messages one right after another(common use case in production), my consumer threads will fetch the messages in the right order(amqp allows this), but the processing could happen in the wrong order and the later updated value could be overwritten by the first one. I will end up having wring data.
How can I make sure my data won't get corrupted? =>Having 1 single consumer thread is not enough, because if I want to scale out by adding more machines with my consuming app, I will still end up having multiple consumers. I might need ordering of messages, but having multiple machines I will probably need to create some sort of a cluster aware component, I am using SI, so this seems really hard to do in my opinion.
In pre 1.2 versions of ES, we used an external version, like a timestamp, and ES would have thrown VersionConflictException in my scenario:first update would have had version 10000 let's say, the second 10001 and if the first would have been processed first, ES would reject the request with version 10000 as it's lower than the existing one. But from the latest versions, ES guys have removed this functionality for update operations.
One solution might be to use multiple queues and have a single consumer on each queue; use a hash function to always route updates to the same document to the same queue see the RabbitMQ Tutorials for the various options.
You can scale out by adding more queues (and changing your hash function).
For resiliency, consider running your consumers in Spring XD. You can have a single instance of each rabbit source (for each queue) and XD will take care of failing it over to another container node if it goes down.
Otherwise you could roll your own by having a warm standby - inbound adapters configured with auto-startup="false" and have something monitor and use a <control-bus/> to start a new instance if the active one goes down.
EDIT:
In response to the fourth comment below.
As I said above, to scale out, you would have to change the hash function. So adding consumers automatically while running would be tricky.
You don't have to hard-code the queue names in the jar, you can use a property placeholder and fill it from properties, system properties, or an environment variable.
This solution is the simplest but does have these limitations.
You could, however, build a management app that could scale it out - stop the producer, wait for all queues to quiesce, reconfigure the consumers and restart the producer - Spring Integration provides a <control-bus/> to start/stop adapters; you can also do it via JMX.
Alternative solutions are possible but will generally require maintaining some shared state across a cluster (perhaps using zookeeper etc), so are much more complex; and you still have to deal with race conditions (where the second update might arrive at some consumer before the first).
You can use the default mechanism for consistency checks. Basically you want to verify that you have the latest version of whatever you are updating.
So for that you need to fetch the _version with the object. In queries you can do this by setting version=true on the toplevel. That will cause the _version to be returned along with your query results. Then when doing an update, you simply set the version parameter in the url to the value you have and it will generate a version conflict if it doesn't match.
Nicer is to handle updates using closures. Basically this works as follows: have an update method that fetches the object by id, applies a closure (parameter to the update function) that encapsulate the modifications you want to make, and then stores modified object. If you trap the still possible version conflict, you can simply get the object again and re-apply the closure to the object. We do this and added a random sleep before the retry as well, this vastly reduces the chance of multiple updates failing and is a nice design pattern. Keeping the read and write together minimizes the chance of a conflict and then retrying with a sleep before that minimizes it further. You could add multiple retries to further reduce the risk.