Queueing mechanism and Elasticsearch 1.4.0 - elasticsearch

I have a RabbitMQ broker, on which I post different messages that will end up as documents in Elasticsearch. There are multiple consumers from the broker, which are actually different threads in a task executor assigned to an amqp inbound gateway (using spring integration and spring amqp here).
Think at the following scenario: I have created a doc in ES with the structure
{
"field1" : "value1",
"field2" : "value2"
}
Afterwards I send two update requests, both updating the same field, let's say field1. If I send this messages one right after another(common use case in production), my consumer threads will fetch the messages in the right order(amqp allows this), but the processing could happen in the wrong order and the later updated value could be overwritten by the first one. I will end up having wring data.
How can I make sure my data won't get corrupted? =>Having 1 single consumer thread is not enough, because if I want to scale out by adding more machines with my consuming app, I will still end up having multiple consumers. I might need ordering of messages, but having multiple machines I will probably need to create some sort of a cluster aware component, I am using SI, so this seems really hard to do in my opinion.
In pre 1.2 versions of ES, we used an external version, like a timestamp, and ES would have thrown VersionConflictException in my scenario:first update would have had version 10000 let's say, the second 10001 and if the first would have been processed first, ES would reject the request with version 10000 as it's lower than the existing one. But from the latest versions, ES guys have removed this functionality for update operations.

One solution might be to use multiple queues and have a single consumer on each queue; use a hash function to always route updates to the same document to the same queue see the RabbitMQ Tutorials for the various options.
You can scale out by adding more queues (and changing your hash function).
For resiliency, consider running your consumers in Spring XD. You can have a single instance of each rabbit source (for each queue) and XD will take care of failing it over to another container node if it goes down.
Otherwise you could roll your own by having a warm standby - inbound adapters configured with auto-startup="false" and have something monitor and use a <control-bus/> to start a new instance if the active one goes down.
EDIT:
In response to the fourth comment below.
As I said above, to scale out, you would have to change the hash function. So adding consumers automatically while running would be tricky.
You don't have to hard-code the queue names in the jar, you can use a property placeholder and fill it from properties, system properties, or an environment variable.
This solution is the simplest but does have these limitations.
You could, however, build a management app that could scale it out - stop the producer, wait for all queues to quiesce, reconfigure the consumers and restart the producer - Spring Integration provides a <control-bus/> to start/stop adapters; you can also do it via JMX.
Alternative solutions are possible but will generally require maintaining some shared state across a cluster (perhaps using zookeeper etc), so are much more complex; and you still have to deal with race conditions (where the second update might arrive at some consumer before the first).

You can use the default mechanism for consistency checks. Basically you want to verify that you have the latest version of whatever you are updating.
So for that you need to fetch the _version with the object. In queries you can do this by setting version=true on the toplevel. That will cause the _version to be returned along with your query results. Then when doing an update, you simply set the version parameter in the url to the value you have and it will generate a version conflict if it doesn't match.
Nicer is to handle updates using closures. Basically this works as follows: have an update method that fetches the object by id, applies a closure (parameter to the update function) that encapsulate the modifications you want to make, and then stores modified object. If you trap the still possible version conflict, you can simply get the object again and re-apply the closure to the object. We do this and added a random sleep before the retry as well, this vastly reduces the chance of multiple updates failing and is a nice design pattern. Keeping the read and write together minimizes the chance of a conflict and then retrying with a sleep before that minimizes it further. You could add multiple retries to further reduce the risk.

Related

MassTransit MessageData Management

I have been starting to make greater use of the message data feature of masstransit and am getting to the point needing to manage the message data in the store - i.e. remove old data.
The obvious choice is to have some outside process tidy up data, but clearly a scheduled (or not) clean up could remove data still in use or referenced by error or dead letter queues.
Ideally I would like to limit stored message data retention to messages only in error or dead letter queues, and automatically remove data for messages that have been successfully processed.
What would be the best approach to achieve this with MassTransit? Perhaps with a MiddleWare approach or similar, and if that is the case what is the correct approach?
Manual cleanup is recommended, using whatever makes sense for the repository in use. Because messages may still be in queues, or in error/dead-letter queues as you pointed out, it is really up to development/operations team to know when the right time is to remove older message data.
I'd suggest monitoring and managing the error/dead-letter queues more aggressively, keeping them empty. And then, just figure a good timeframe to delete old message data - one week, ten days, whatever - and deal with it that way.
I have had a backlog item to come up with a way to automatically manage message data, but since message data can be forwarded (using the same stored data) either via publish or send, there is no good way to track references.

Spring Boot Kafka: Consume same message with all instances for specific topic

I have a spring boot application (let's say it's called app-1) that is connected to a kafka cluster and that consumes from a specific topic, let's say the topic is called "foo". Topic foo always receives a message when another application (let's say it's called app-2) has imported a new foo-item into the database.
The topic is primarily meant to be used in a third application (let's say it's called app-3) which sends out some e-Mail notification to people that may be interested in this new foo-item. App-3 is clustered, meaning there are multiple instances of it running at the same time. Kafka automatically balances the foo-topic messages between all these instances because they use the same consumer-id. This is good and in the case of app-3 it is actually desired.
In the case of app-2, however, the messages from the foo-topic are used for cache eviction. The logic is, basically, that if there is a new foo-item then the currently existing caches should probably be cleared, because their content depends on the foo-items. The issue is that app-2 is also clustered, which means that by default kafka-logic, every instance will only receive some of the messages sent to the foo-topic. This does not work correctly for this specific app tho, because whenever there is a new foo-item, all of the instances need to know about it because all of them need their clear their local caches.
From what I understand I have these two options if I want to keep the current logic:
Introduce a distributed cache for all instances of app-2 so that they all share the same cache. Then it does not matter if only one instance receives a foo-item, because the cache eviction will also affect the cache of the other instances; even though they never learned about the foo-item. I would like to avoid this solution, as a distributed cache would add a noticeable amount of complexity and also overhead.
Somehow manage to use a different consumer-id for each instance of app-2. Then they would be considered different consumers by kafka and they all would get each foo-topic message. However, I don't even know how to programmatically do this. The code of the application is not aware of replicated instances, there is no way to access any information about what node it is. If I use a randomly generated string on startup, then each time such instance restarts it would be considered a new consumer and would have to re-process all previous messages. That would be incorrect behavior as well.
Here is my bottom line question: Is it possible to make all instances of app-2 receive all messages from the foo-topic without completely breaking the way kafka is supposed to work? I know that it is probably very unconventional to use kafka-messages for cache eviction and I am entirely able to find an alternative mechanism for the cache eviction logic that does not depend on kafka-topic messages. However, the applications are for demonstration purposes and I thought it would be cool if more than one app read from this topic. But if I end up having to hack a dirty workaround to make it work then it's also bad for demonstration purposes and I would rather implement an alternative way of cache eviction.
As you mentioned, you could use different consumer ids with random strings.
If notifications are being read from the beginning, then you probably have ConsumerConfig.AUTO_OFFSET_RESET_CONFIG set to "earliest" somewhere in your consumer configuration. If this is the case, removing it will probably solve your problems - when the app will start it will only receive notification sent after the consumer started listening.

Using my own Cassandra driver to write aggregations results

I'm trying to create a simple application which writes to Cassandra the page views of each web page on my site. I want to write every 5 minutes the accumulative page views from the start of a logical hour.
My code for this looks something like this:
KTable<Windowed<String>, Long> hourlyPageViewsCounts = keyedPageViews
.groupByKey()
.count(TimeWindows.of(TimeUnit.MINUTES.toMillis(60)), "HourlyPageViewsAgg")
Where I also set my commit interval to 5 minutes by setting the COMMIT_INTERVAL_MS_CONFIG property. To my understanding that should aggregate on full hour and output intermediate accumulation state every 5 minutes.
My questions now are two:
Given that I have my own Cassandra driver, how do I write the 5 min intermediate results of the aggregation to Cassandra? Tried to use foreach but that doesn't seem to work.
I need a write only after 5 min of aggregation, not on each update. Is it possible? Reading here suggests it might not without using low-level API, which I'm trying to avoid as it seems like a simple enough task to be accomplished with the higher level APIs.
Committing and producing/writing output is two different concepts in Kafka Streams API. In Kafka Streams API, output is produced continuously and commits are used to "mark progress" (ie, to commit consumer offsets including the flushing of all stores and buffered producer records).
You might want to check out this blog post for more details: https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/
1) To write to Casandra, it is recommended to write the result of you application back into a topic (via #to("topic-name")) and use Kafka Connect to get the data into Casandra.
Compare: External system queries during Kafka Stream processing
2) Using low-level API is the only way to go (as you pointed out already) if you want to have strict 5-minutes intervals. Note, that next release (Kafka 1.0) will include wall-clock-time punctuations which should make it easier for you to achieve your goal.

Check if S3 file has been modified

How can I use a shell script check if an Amazon S3 file ( small .xml file) has been modified. I'm currently using curl to check every 10 seconds, but it's making many GET requests.
curl "s3.aws.amazon.com/bucket/file.xml"
if cmp "file.xml" "current.xml"
then
echo "no change"
else
echo "file changed"
cp "file.xml" "current.xml"
fi
sleep(10s)
Is there a better way to check every 10 seconds that reduces the number of GET requests? (This is built on top of a rails app so i could possibly build a handler in rails?)
Let me start by first telling you some facts about S3. You might know this, but in case you don't, you might see that your current code could have some "unexpected" behavior.
S3 and "Eventual Consistency"
S3 provides "eventual consistency" for overwritten objects. From the S3 FAQ, you have:
Q: What data consistency model does Amazon S3 employ?
Amazon S3 buckets in all Regions provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES.
Eventual consistency for overwrites means that, whenever an object is updated (ie, whenever your small XML file is overwritten), clients retrieving the file MAY see the new version, or they MAY see the old version. For how long? For an unspecified amount of time. It typically achieves consistency in much less than 10 seconds, but you have to assume that it will, eventually, take more than 10 seconds to achieve consistency. More interestingly (sadly?), even after a successful retrieval of the new version, clients MAY still receive the older version later.
One thing that you can be assured of is: if a client starts download a version of the file, it will download that entire version (in other words, there's no chance that you would receive for example, the first half of the XML file as the old version and the second half as the new version).
With that in mind, notice that your script could fail to identify the change within your 10-second timeframe: you could make multiple requests, even after a change, until your script downloads a changed version. And even then, after you detect the change, it is (unfortunately) entirely possible the the next request would download the previous (!) version, and trigger yet another "change" in your code, then the next would give the current version, and trigger yet another "change" in your code!
If you are OK with the fact that S3 provides eventual consistency, there's a way you could possibly improve your system.
Idea 1: S3 event notifications + SNS
You mentioned that you thought about using SNS. That could definitely be an interesting approach: you could enable S3 event notifications and then get a notification through SNS whenever the file is updated.
How do you get the notification? You would need to create a subscription, and here you have a few options.
Idea 1.1: S3 event notifications + SNS + a "web app"
If you have a "web application", ie, anything running in a publicly accessible HTTP endpoint, you could create an HTTP subscriber, so SNS will call your server with the notification whenever it happens. This might or might not be possible or desirable in your scenario
Idea 2: S3 event notifications + SQS
You could create a message queue in SQS and have S3 deliver the notifications directly to the queue. This would also be possible as S3 event notifications + SNS + SQS, since you can add a queue as a subscriber to an SNS topic (the advantage being that, in case you need to add functionality later, you could add more queues and subscribe them to the same topic, therefore getting "multiple copies" of the notification).
To retrieve the notification you'd make a call to SQS. You'd still have to poll - ie, have a loop and call GET on SQS (which cost about the same, or maybe a tiny bit more depending on the region, than S3 GETs). The slight difference is that you could reduce a bit the number of total requests -- SQS supports long-polling requests of up to 20 seconds: you make the GET call on SQS and, if there are no messages, SQS holds the request for up to 20 seconds, returning immediately if a message arrives, or returning an empty response if no messages are available within those 20 seconds. So, you would send only 1 GET every 20 seconds, to get faster notifications than you currently have. You could potentially halve the number of GETs you make (once every 10s to S3 vs once every 20s to SQS).
Also - you could chose to use one single SQS queue to aggregate all changes to all XML files, or multiple SQS queues, one per XML file. With a single queue, you would greatly reduce the overall number of GET requests. With one queue per XML file, that's when you could potentially "halve" the number of GET request as compared to what you have now.
Idea 3: S3 event notifications + AWS Lambda
You can also use a Lambda function for this. This could require some more changes in your environment - you wouldn't use a Shell Script to poll, but S3 can be configured to call a Lambda Function for you as a response to an event, such as an update on your XML file. You could write your code in Java, Javascript or Python (some people devised some "hacks" to use other languages as well, including Bash).
The beauty of this is that there's no more polling, and you don't have to maintain a web server (as in "idea 1.1"). Your code "simply runs", whenever there's a change.
Notice that, no matter which one of these ideas you use, you still have to deal with eventual consistency. In other words, you'd know that a PUT/POST has happened, but once your code sends a GET, you could still receive the older version...
Idea 4: Use DynamoDB instead
If you have the ability to make a more structural change on the system, you could consider using DynamoDB for this task.
The reason I suggest this is because DynamoDB supports strong consistency, even for updates. Notice that it's not the default - by default, DynamoDB operates in eventual consistency mode, but the "retrieval" operations (GetItem, for example), support fully consistent reads.
Also, DynamoDB has what we call "DynamoDB Streams", which is a mechanism that allows you to get a stream of changes made to any (or all) items on your table. These notifications can be polled, or they can even be used in conjunction with a Lambda function, that would be called automatically whenever a change happens! This, plus the fact that DynamoDB can be used with strong consistency, could possibly help you solve your problem.
In DynamoDB, it's usually a good practice to keep the records small. You mentioned in your comments that your XML files are about 2kB - I'd say that could be considered "small enough" so that it would be a good fit for DynamoDB! (the reasoning: DynamoDB reads are typically calculated as multiples of 4kB; so to fully read 1 of your XML files, you'd consume just 1 read; also, depending on how you do it, for example using a Query operation instead of a GetItem operation, you could possibly be able to read 2 XML files from DynamoDB consuming just 1 read operation).
Some references:
http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
http://docs.aws.amazon.com/lambda/latest/dg/with-ddb.html
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_ReceiveMessage.html
I can think of another way by using S3 Versioning; this would require the least amount of changes to your code.
Versioning is a means of keeping multiple variants of an object in the same bucket.
This would mean that every time a new file.xml is uploaded, S3 will create a new version.
In your script, instead of getting the object and comparing it, get the HEAD of the object which contains the VersionId field. Match this version with the previous version to find out if the file has changed.
If the file has indeed changed, get the new file, and also get the new version of that file and save it locally so that next time you can use this version to check if a newer-newer version has been uploaded.
Note 1: You will still be making lots of calls to S3, but instead of fetching the entire file every time, you are only fetching the metadata of the file which is much faster and smaller in size.
Note 2: However, if your aim was to reduce the number of calls, the easiest solution I can think of is using lambdas. You can trigger a lambda function every time a file is uploaded that then calls the REST endpoint of your service to notify you of the file change.
You can use --exact-timestamps
see AWS discussion
https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
Instead of using versioning, you can simply compare the E-Tag of the file, which is available in the header, and is similar to the MD-5 hash of the file (and is exactly the MD-5 hash if the file is small, i.e. less than 4 MB, or sometimes even larger. Otherwise, it is the MD-5 hash of a list of binary hashes of blocks.)
With that said, I would suggest you look at your application again and ask if there is a way you can avoid this critical path.

An event store could become a single point of failure?

Since a couple of days I've been trying to figure it out how to inform to the rest of the microservices that a new entity was created in a microservice A that store that entity in a MongoDB.
I want to:
Have low coupling between the microservices
Avoid distributed transactions between microservices like Two Phase Commit (2PC)
At first a message broker like RabbitMQ seems to be a good tool for the job but then I see the problem of commit the new document in MongoDB and publish the message in the broker not being atomic.
Why event sourcing? by eventuate.io:
One way of solving this issue implies make the schema of the documents a bit dirtier by adding a mark that says if the document have been published in the broker and having a scheduled background process that search unpublished documents in MongoDB and publishes those to the broker using confirmations, when the confirmation arrives the document will be marked as published (using at-least-once and idempotency semantics). This solutions is proposed in this and this answers.
Reading an Introduction to Microservices by Chris Richardson I ended up in this great presentation of Developing functional domain models with event sourcing where one of the slides asked:
How to atomically update the database and publish events and publish events without 2PC? (dual write problem).
The answer is simple (on the next slide)
Update the database and publish events
This is a different approach to this one that is based on CQRS a la Greg Young.
The domain repository is responsible for publishing the events, this
would normally be inside a single transaction together with storing
the events in the event store.
I think that delegate the responsabilities of storing and publishing the events to the event store is a good thing because avoids the need of 2PC or a background process.
However, in a certain way it's true that:
If you rely on the event store to publish the events you'd have a
tight coupling to the storage mechanism.
But we could say the same if we adopt a message broker for intecommunicate the microservices.
The thing that worries me more is that the Event Store seems to become a Single Point of Failure.
If we look this example from eventuate.io
we can see that if the event store is down, we can't create accounts or money transfers, losing one of the advantages of microservices. (although the system will continue responding querys).
So, it's correct to affirmate that the Event Store as used in the eventuate example is a Single Point of Failure?
What you are facing is an instance of the Two General's Problem. Basically, you want to have two entities on a network agreeing on something but the network is not fail safe. Leslie Lamport proved that this is impossible.
So no matter how much you add new entities to your network, the message queue being one, you will never have 100% certainty that agreement will be reached. In fact, the opposite takes place: the more entities you add to your distributed system, the less you can be certain that an agreement will eventually be reached.
A practical answer to your case is that 2PC is not that bad if you consider adding even more complexity and single points of failures. If you absolutely do not want a single point of failure and wants to assume that the network is reliable (in other words, that the network itself cannot be a single point of failure), you can try a P2P algorithm such as DHT, but for two peers I bet it reduces to simple 2PC.
We handle this with the Outbox approach in NServiceBus:
http://docs.particular.net/nservicebus/outbox/
This approach requires that the initial trigger for the whole operation came in as a message on the queue but works very well.
You could also create a flag for each entry inside of the event store which tells if this event was already published. Another process could poll the event store for those unpublished events and put them into a message queue or topic. The disadvantage of this approach is that consumers of this queue or topic must be designed to de-duplicate incoming messages because this pattern does only guarantee at-least-once delivery. Another disadvantage could be latency because of the polling frequency. But since we have already entered the eventually consistent area here this might not be such a big concern.
How about if we have two event stores, and whenever a Domain Event is created, it is queued onto both of them. And the event handler on the query side, handles events popped from both the event stores.
Ofcourse every event should be idempotent.
But wouldn’t this solve our problem of the event store being a single point of entry?
Not particularly a mongodb solution but have you considered leveraging the Streams feature introduced in Redis 5 to implement a reliable event store. Take a look this intro here
I find that it has rich set of features like message tailing, message acknowledgement as well as the ability to extract unacknowledged messages easily. This surely helps to implement at least once messaging guarantees. It also support load balancing of messages using "consumer group" concept which can help with scaling the processing part.
Regarding your concern about being the single point of failure, as per the documentation, streams and consumer information can be replicated across nodes and persisted to disk (using regular Redis mechanisms I believe). This helps address the single point of failure issue. I'm currently considering using this for one of my microservices projects.

Resources