We're building a microservice system which new data can come from three(or more) different sources and which eventually effects the end user.
It doesn't matter what the purpose of the system for the question so I'll really try to make it simple. Please see the attached diagram.
Data can come from the following sources:
Back-office site: define the system and user configurations.
Main site: where user interact with the site and make actions.
External sources data: such as partners which can gives additional data(supplementary information) about users.
The services are:
Site-back-office service: serve the back-office site.
User-service: serve the main site.
Import service: imports additional data(supplementary information) from external sources.
User cache service: sync with all the above system data and combine them to pre-prepared cache responses. The reason for that is because the main site should serve hundreds of millions of user and should work with very low latency.
The main idea is:
Each microservice has its own db.
Each microservice can scale.
Each data change on one of the three parts effects the user and should be sent to the cache service so it eventually be reflect on the main site.
The cache (Redis) holds all data combined to pre-prepared responses for the main-site.
Each service data change will be published to pubsub topic for the cache-service to update the Redis db.
The system should serve around 200 million of users.
So... the questions are: .
since the User-cache service can(and must) be scale, what happen if, for example, there are two update data messages waiting on pubsub, one is old and one is new. how to process only the new message and prevent the case when one cache-service instance update the new message data to Redis and only after another cache-service instance override it with the old message.
There is also a case when the Cache-service instance need to first read the current cache user data, make the change on it and only then update the cache with the new data. How to prevent the case when two instances for example read the current cache data while a third instance update it with new data and they override it with their data.
Is it at all possible to pre-prepare responses based on several sources which can periodically change?? what is the right approach to this problem?
I'll try to address some of your points, let me know if I misunderstood what you're asking.
1) I believe you're asking about how to enforce ordering of messages, that an old update does not override a newer one. There "publish_time" field of a message (https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage) to coordinate based on the time the cloud pubsub server received your publish request. If you wish to coordinate based on some other time or ordering mechanism, you can add an attribute to your PubsubMessage or payload to do so.
2) This seems to be a general synchronization problem, not necessarily related to cloud pubsub; I'll leave this to others to answer.
3) Cloud dataflow implements a windowing and watermark mechanism similar to what you're describing. Perhaps you could use this to remove conflicting updates and perform preprocessing prior to writing them to the backing store.
https://beam.apache.org/documentation/programming-guide/#windowing
-Daniel
Does Spring Session management take care of asynchronous calls?
Say that we have multiple controllers and each one is reading/writing different session attributes. Will there be a concurrency issue as the session object is entirely written/read to/from external servers and not the attributes alone?
We are facing such an issue that the attributes set from a controller are not present in the next read... this is an intermittent issue depending on the execution of other controllers in parallel.
When we use the session object from the container we never faced this issue... assuming that it is a direct attribute set/get happening right on to the session object in the memory.
The general use case for the session is storing some user specific data. If I am understanding your context correctly, your issue describes the scenario in which a user, while for example being authenticated from two devices (for example a PC and a phone - hence withing the bounds of the same session) is hitting your backend with requests so fast you face concurrency issues around reading and writing the session data.
This is not a common (and IMHO reasonable) scenario for the session, so projects such as spring-data-redis or spring-data-gemfire won't support it out of the box.
The good news is that spring-session was built with flexibility in mind, so you could of course achieve what you want. You could implement your own version of SessionRepository and manually synchronize (for example via Redis distributed locks) the relevant methods. But, before doing that, check your design and make sure you are using session for the right data storage job.
This question is very similar in nature to your last question. And, you should read my answer to that question before reading my response/comments here.
The previous answer (and insight) posted by the anonymous user is fairly accurate.
Anytime you have a highly concurrent (Web) application/environment where many different, simultaneous HTTP requests are coming in, accessing the same HTTP session, there is always a possibility for lost updates caused by race conditions between competing concurrent HTTP requests. This is due to the very nature of a Servlet container (e.g. Apache Tomcat, or Eclipse Jetty) since each HTTP request is processed by, and in, a separate Thread.
Not only does the HTTP session object provided by the Servlet container need to be Thread-safe, but so too do all the application domain objects that your Web application puts into the HTTP session. So, be mindful of this.
In addition, most HTTP session implementations, such as Apache Tomcat's, or even Spring Session's session implementations backed by different session management providers (e.g. Spring Session Data Redis, or Spring Session Data GemFire) make extensive use of "deltas" to send only the changes (or differences) to the Session state, there by minimizing the chance of lost updates due to race conditions.
For instance, if the HTTP session currently has an attribute key/value of 1/A and HTTP request 1 (processed by Thread 1) reads the HTTP session (with only 1/A) and adds an attribute 2/B, while another concurrent HTTP request 2 (processed by Thread 2) reads the same HTTP session, by session ID (seeing the same initial session state with 1/A), and now wants to add 3/C, then as Web application developers, we expect the end result and HTTP session state to be, after request 1 & 2 in Threads 1 & 2 complete, to include attributes: [1/A, 2/B, 3/C].
However, if 2 (or even more) competing HTTP requests are both modifying say HTTP sessoin attribute 1/A and HTTP request/Thread 1 wants to set the attribute to 1/B and the competing HTTP request/Thread 2 wants to set the same attribute to 1/C then who wins?
Well, it turns out, last 1 wins, or rather, the last Thread to write the HTTP session state wins and the result could either be 1/B or 1/C, which is indeterminate and subject to the vagaries of scheduling, network latency, load, etc, etc. In fact, it is nearly impossible to reason which one will happen, much less always happen.
While our anonymous user provided some context with, say, a user using multiple devices (a Web browser and perhaps a mobile device... smart phone or tablet) concurrently, reproducing this sort of error with a single user, even multiple users would not be impossible, but very improbable.
But, if we think about this in a production context, where you might have, say, several hundred Web application instances, spread across multiple physical machines, or VMs, or container, etc, load balanced by some network load balancer/appliance, and then throw in the fact that many Web applications today are "single page apps", highly sophisticated non-dumb (no longer thin) but thick clients with JavaScript and AJAX calls, then we begin the understand that this scenario is much more likely, especially in a highly loaded Web application; think Amazon or Facebook. Not only many concurrent users, but many concurrent requests by a single user given all the dynamic, asynchronous calls that a Web application can make.
Still, as our anonymous user pointed out, this does not excuse the Web application developer from responsibly designing and coding our Web application.
In general, I would say the HTTP session should only be used to track very minimal (i.e. in quantity) and necessary information to maintain a good user experience and preserve the proper interaction between the user and the application as the user transitions through different parts or phases of the Web app, like tracking preferences or items (in a shopping cart). In general, the HTTP session should not be used to store "transactional" data. To due so is to get yourself into trouble. The HTTP session should be primarily a read heavy data structure (rather than write heavy), particularly because the HTTP session can be and most likely will be accessed from multiple Threads.
Of course, different backing data stores (like Redis, and even GemFire) provide locking mechanisms. GemFire even provides cache level transactions, which is very heavy and arguable not appropriate when processing Web interactions managed in and by an HTTP session object (not to be confused with transactions). Even locking is going to introduce serious contention and latency to the application.
Anyway, all of this is to say that you very much need to be conscious of the interactions and data access patterns, otherwise you will find yourself in hot water, so be careful, always!
Food for thought!
How to use security in SD synchronization without GAM?
I need to block unwanted connections. How can I validate the execution of
Synchronization.Send () and Synchronization.Receive ()
I can not use GAM because I have to adapt my application to a pre existing security system.
There is currently no way for sending additional parameters or HTTP headers in the requests, so you'll need other means to identify your user.
One thing you could do, is call a procedure before synchronizing, passing the relevant information to identify the user (an authorization token or something like that). Then, you should validate that the next call is to the synchronization process, and check for instance that the IP address and the "device id" are the same.
Where would you validate the user's information, depends on which synchronization are we talking about.
For the Receive operation, you may perform your validations in the Offline Database object's Start event.
For the Send operation, everything is saved to the database by using Business Components. So you may add your validations in all the BCs that are involved.
Note: having said all the above, it is highly recommended that you use GeneXus Access Manager (a.k.a. GAM), where all this is already solved.
Second note: you should use HTTPS in all your connections; otherwise, none of this will be secure.
I am looking for a library that will help me keep some state in sync between my server and my GUI in "real time". I have the messaging and middleware sorted (push updates etc), but what I need is a protocol on top of that which guarantees that the data stays in sync within some reasonably finite period - an error / dropped message / exception might cause the data to go out of syn for a few seconds, but it should resync or at least know it is out of sync within a few seconds.
This seems like it should be something that has been solved before but I can't seem to find anything suitable - any help much appreciated
More detail - I have a Rich Client (Silverlight but likely to move to Javascript/C# or Java soon) GUI that is served by a JMS type middleware.
I am looking to re engineer some of the data interactions to something like as follows
Each user has their own view on several reasonably small data sets for items such as:
Entitlements (what GUI elements to display)
GUI data (e.g. to fill drop down menus etc)
Grids of business data (e.g. a grid of orders)
Preferences (e.g. how the GUI is laid out)
All of these data sets can be changed on the server at any time and the data should update on the client as soon as possible.
Data is changed via the server – the client asks for a change (e.g. cancel a request) and the server validates it against entitlements and business rules and updates its internal data set which would then send the change back to the GUI. In order to provide user feedback an interim state may be set on the gui (cancel submitted or similar) which is the over ridden by the server response.
At the moment the workflow is:
User authenticates
GUI downloads the initial data sets from the server (which either loads them from the database or some other business objects it has cached)
GUI renders
GUI downloads a snapshot of the business data
GUI subscribes to updates to the business data
As updates come in the GUI updates the model and view on screen
I am looking for a generalised library that would improve on this
Should be cross language using an efficient payload format (e.g. Java back end, C# front end, protobuf data format)
Should be transport agnostic (we use a JMS style middleware we don’t want to replace right now)
The client should be sent a update when a change occurs to the server side dataset
The client and server should be able to check for changes to ensure they are up to date
The data sent should be minimal (minimum delta)
Client and Server should cope with being more than one revision out of sync
The client should be able to cache to disk in between session and then just get deltas on login.
I think the ideal solution would be used something like
Any object (or object tree) can be registered with the library code (this should work with data/objects loaded via Hibernate)
When the object changes the library notifys a listener / callback with the change delta
The listener sends that delta to the client using my JMS
The client gets the update and can give that back to the client side version of the library which will update the client side version of the object
The client should get sufficient information from the update to be able to decide what UI action needs to be taken (notify user, update grid etc)
The client and server periodically check that they are on the same version of the object (e.g. server sends the version number to the client) and can remediate if necessary by either the server sending deltas or a complete refresh if necessary.
Thanks for any suggestions
Wow, that's a lot!
I have a project going on which deals with the Synchronization aspect of this in Javascipt on the front end. There is a testing server wrote in Node.JS (it actually was easy once the client was was settled).
Basically data is stored by key in a dataset and every individual key is versioned. The Server has all versions of all data and the Client can be fed changes from the server. Version conflicts for when something is modified on both client and server are handled by a conflict resolution callback.
It is not complete, infact it only has in-memory stores at the moment but that will change over the new week or so.
The actual notification/downloading and uploading is out of scope for the library but you could just use Sockets.IO for this.
It currently works with jQuery, Dojo and NodeJS, really it's got hardly any dependencies at all.
The project (with a demo) is located at https://github.com/forbesmyester/SyncIt
We're investigating moving to a distributed cache using Windows AppFabric. Our ASP.NET 4.0 application currently has a cache implementation that uses MemoryCache.
One key feature is that when items are added to the cache, a CacheItemPolicy is included that contains a ChangeMonitor:
CacheItemPolicy policy = new CacheItemPolicy();
policy.Priority = CacheItemPriority.Default;
policy.ChangeMonitors.Add(new LastPublishDateChangeMonitor(key, item, GetLastPublishDateCallBack));
The change monitor internally uses a timer to periodically trigger the delegate passed into it - which is usually a method to get a value from a DB for comparison.
The policy and its change monitor are then included when an item is added to the cache:
Cache.Add(key, item, policy);
An early look at AppFabric's DataCache class seem to indicate whilst a Timespan can be included when adding items to cache, a CacheItemPolicy itself can't be.
Is there an another way to implement the same ChangeMonitor-type functionality in AppFabric. Notifications perhaps?
Cheers
Neil
There are only two hard problems in computer science: cache
invalidation, naming things and off-by-one errors.
Phil Karlton
Unfortunately AppFabric has no support for this sort of monitoring to invalidate a cached item, and similarly no support for things like SqlCacheDependency.
However, AppFabric 1.1 brought in support for read-through and write-behind. Write-behind means that your application updates the cached data first rather than the underlying database, so that the cache always holds the latest version (and therefore the underlying data doesn't need to be monitored); the cache then updates the underlying database asynchronously. To implement read-through/write-behind, you'll need to create an object that inherits from DataCacheStoreProvider (MSDN) and write Read, Write and Delete methods that understand the structure of your database and how to update it.