Preventing data loss in client authoritative database writes - algorithm

A project I'm working on requires users to insert themselves into a list on a server. We expect a few hundred users over a weekend and while very unlikely, a collision could happen in which two users submit the list concurrently and one of them is lost. The server has no validation, it simply allows you to get and put data.
I was pointed in the direction of "optimistic locking" but I'm having trouble grasping when exactly the data should be validated and how it prevents this from happening. If one of the clients reads the data, adds itself and then checks again to ensure that the data is the same with the use of an index or timestamp, how does this prevent the other client from doing the same and then one overwriting the other?
I'm trying to understand the flow in the context of two clients getting data and putting data.

The point of optimistic locking is that the decision to accept or reject a write is taken on the server, and is protected against concurrency by a pessimistic transaction or some sort of hardware protection, such as compare-and-swap. So a client requests a write together with some sort of timestamp or version identifier, and the server only accepts the write if the timestamp is still accurate. If it isn't the client gets some sort of rejection code and will have to try again. If it is, the client gets told that its write succeeded.
This is not the only way to handle receiving data from multiple clients. One popular alternative is to use a reliable messaging system - for example the Java Messaging Service specifies an interface for such systems for which you can find open source implementations. Clients write into the messaging system and can go away as soon as their message is accepted. The server reads requests from the messaging system and acts on them. If the server or the network goes down it's no big deal: the messages will still be there to be read when they come back (typically they are written to disk and have the same level of protection as database data although if you look at a reliable message queue implementation you may find that it is not, in fact, built on top of a standard database table).
One example of a writeup of the details of optimistic locking is the HTTP server Etag specification e.g. https://en.wikipedia.org/wiki/HTTP_ETag

Related

Atomically update database and send message. Outbox pattern or not?

You have a command/operation which means you both need to save something in database end send an event/message to another system. For example you have an OrderService and when a new order is created you want to publish an "OrderCreated"-event for another system/systems to react on (either direct message or using a message broker) and do something.
The easiest (and naive) implementation is to save in db and if successful then send message. But of course this is not bullet proof because the other service/message broker is down or your service crash before sending message.
One (and common?) solution is to implement "outbox pattern", i.e. instead of publish messages directly you save the message to an outbox table in your local database as part of your database transaction (in this example save to outbox table as well as order table) and have a different process (polling db or using change data capture) reading the outbox table and publish messages.
What is your solution to this dilemma, i.e. "update database and send message or do neither"? Note: I am not talking about using SAGAs (could be part of a SAGA though but this is next level).
I have in the past used different approaches:
"Do nothing", i.e just try to send the message and hope it will be sent. Which might be fine in some cases especially with a stable message broker running on same machine.
Using DTC (in my case MSDTC). Beside all the problem with DTC it might not work with your current solution.
Outbox pattern
Using an orchestrator which will retry process if you have not got a "completed" event.
In my current project it is not handled well IMO and I want to change it to be more resilient and self correcting. Sometimes when a service is calling another service and it fails the user might retry and it might work ok. But some operations might require out support to fix it (if it is even discovered).
ATM it is not a Microservice solution but rather two large (legacy) monoliths communicating and is running on same server but moving to a Microservice architecture in the near future and might run on multiple machines.

socket io - Emit an event every X seconds or just emit it after a POST event?

I'm using socket io, and I was wondering what was better.
Emiting an event every X seconds to keep always updated with the database or emit the event after e.g a POST event, so it's more efficient.
I believe updating X seconds should be easier, and maybe has better scalability, but don't know if that's the correct way.
EDIT-1: To give more context. The application is for an accounting team. They basically want their excel sheets converted to a app. They have a lot of data, so I don't know if emitting an event every X seconds is a good idea.
Thanks.
There is no "correct" way. It depends entirely upon the needs of your client and the capabilities of your server. If the client needs to be kept more instantly up-to-date, then send data from your server to the client whenever the server has new data. If the client only needs to be updated every once-in-a-while, then only send it data every once-in-a-while. There is no "correct" way. It depends upon your application.
It is always more efficient to only send data to the client when the data has actually changed and when the client actually cares that something has changed. So, it would be foolish to send a client update every few seconds if the data isn't actually changing that often. If you have a means of knowing when the data changes on the server, then use that event to know when to send data to the client and even then, don't send it more often than the client actually cares to know.
It is always more efficient to have the server do no more work than is actually required by the client. Things like caching and keeping track of what each client was last sent can sometimes save lots of work for the server too.
Any further advice on this matter would need to know a lot more about the needs of your application and how this particular data fits into that and how often the data in question actually changes.
A summary on this topic:
Send data to the client no more often than it needs it
Sending data to the client that has not changed since the last time you changed it is inefficient for the server and consumes bandwidth.
Only you can decide how often your client needs updates (it depends upon your application)
Only you can test the impact on scalability of sending data to every client every time the data changes.
Server-side caching and keeping track of what client already has what data can help you avoid sending data to a client that it already has.
Server-side scalability probably has a lot to do with how many simultaneous clients are connected and how frequently there is changed data to send them.

Sending files over MSMQ

In a retail scenario where each stores report their daily transaction to the backend system at the end of the day. Today a file consisting of the daily transactions and some other meta information is transferred from the stores to the backend using FTP. I’m currently investigating replacing FTP with something else. MSMQ has been suggested as an alternative transport mechanism. So my question is, do we need to write a custom windows service that sticks the daily transactions file into a message object and sends it on its way or is there any out the box mechanism in MSMQ to handle this?
Also, since the files we want to transfer can reach 5-6 Mb for large stores should we rule out MSMQ? In that case is there any other suggested technologies we should investigate?
Cheers!
NServiceBus provides a nice abstraction over MSMQ for situations like this. You get the reliable messaging aspects of MSMQ, along with a very nice programming model for defining your messages.
MSMQ is limited to a 4MB message size, however, and there are two ways you could deal with this in NServiceBus:
NServiceBus has a concept called the Data Bus, which takes the large attachments in your messages and transmits them reliably using another method. This is handled by the infrastructure and as far as your message handlers are concerned, the data is just there.
You could break up the payload into smaller atomic messages and send them as normal messages. The NServiceBus infrastructure would ensure that they all arrive at their destination and are processed. I would recommend this method unless it's absolutely critical that the entire huge data dump is processed as one atomic transaction.
One other thing to note is that the fact that you do nightly dumps is probably a limitation of a previous system. With NServiceBus it may be possible to change the system so that these bits of information are sent in a more immediate fashion, which will result in much more up-to-date data all the time, which may be a big win for the business.
You can look at IBM Sterling Managed File Transfer and WebSphere MQ Managed File Transfer products.
You can consider WebSphere MQ MFT if you require both messaging and file transfer capabilities. On the other hand if your requirement is just file transfer then you can look at Sterling MFT.
Sending files over a messaging transport is not trivial. If you put the entire file into a single message you can have the atomicity you need but tuning the messaging provider for wide variance in message sizes can be challenging. If all the files are of about the same size, one per message is about the simplest solution.
On the other hand, you can split the files into multiple messages but then you've got to reassemble them, in the right order, include a protocol to detect and resend missing segments, integrity-check the file received against the file sent, etc. You also probably want to check that the files on either end did not change during the transmission.
With any of these systems you also need the system to be smart enough to manage the disposition of sending and receiving files under normal and exception conditions, log the transfers, etc.
So when considering whether to move to messaging the two best options are either to move natively to messaging and give up files altogether, or to use an enterprise managed file transfer solution that runs atop the messaging provider that you choose. None of the off-the-shelf MFT products will cost as much in the long run as developing it yourself if you wish to do it right with robust exception handling and reporting.
If the stores are on separate networks and communicating over the internet, then MSMQ is not really an option. NServiceBus provides a concept of a gateway, which allows to asynchronously transport MSMQ messages over HTTP or HTTPS.

Cache values in Java EE

I'm building a simple message delegation application. Messages are being send on both ends via JMS. I'm using a MDB to process incoming messages, transform them and send them to a target queue. Unfortunately the same messages can be send to the incoming queue more than once but it is not allowed to forward duplicates.
So what is the best way to accomplish that?
Since there can be multiple MDBs listening on the incoming queue a need a single cache where I can store the unique message uuids of the incoming messages for at least an hour. How should this cache be accessed? Via a singleton/ static class (I'm running Java EE 5 and thus don't have the singleton annotation)?
In addition I think all operations must be synchronized, right? Does that harm performance too much?
#Ingo: are you OK with database solution. You can full fledged DB server or simple apache derby solution for this..
If so, you can have a simple table where you can store message unique UId and can check against it for uniqueness....this solution will have following benefits:
Simple code
No need of time bound cache(1 hour). You can check for uniqueness of a message forever.
Persistent record of what messages came in.
No need of expensive synchronized, you can rely on DB isolation level to have consistency.
centralized solution for your possibly many deployments of application.

Push or Pull for a near real time automation server?

We are currently developing a server whereby a client requests interest in changes to specific data elements and when that data changes the server pushes the data back to the client. There has vigorous debate at work about whether or not it would be better for the client to poll for this data.
What is considered to be the ideal method, in terms of performance, scalability and network load, of data transfer in a near real time environment?
Update:
Here's a Link that gives some food for thought with regards to UI updates.
There's probably no ideal method for every situation, but push is usually better and used more often. It allows to optimize server caching and data transfers, which helps performance and scalability, and cuts network traffic a bit by avoiding client requests and empty responses. It can be important advantage for a server to operate in it's own pace and supply clients with data when it is ready.
Industry standarts - such as OPC, GID - support both. Server pushes updates to subscribed clients, but client can pull some rarely used data out without bothering with subscription.
As long as the client initiates the connection (to get passed firewall and NAT problems) either way is fine.
If there are several different type of data you need to send, you might want to have the client specify which type he wants, but this is only needed once per connection. Then you can have the server continue to send updates as it has them.
It would be less network traffic to have the server send updates without the client continually asking for updates.
What do you have on the client's side? Many firewalls allow outgoing requests but block incoming requests. In other words, pull may be your only option if you are crossing the Internet unless you are sending out e-mails.

Resources