Comparison with mainstream workflow engines - spring-statemachine

I'd like to use Spring SM in my next future that has very simple workflows, 3-4 states, rule based transitions, and max actors.
The WF is pretty fixed, so storing its definition in java config is quite ok.
I'd prefer to use SM than WF engine which comes with the whole machinery, but I couldnt find out if there is a notion of Actor.
Meaning, only one particular user (determined by login string) can trigger a transition between states.
Also, can I run the same State machine definition in parallel. Is there a notion of instance, like process instance in WF jargon?
Thanks,
Milan

Actor with a security is an interesting concept but we don't have anything build in right now. I'd say that this can be accomplished via Spring Security i.e. https://spring.io/blog/2013/07/04/spring-security-java-config-preview-method-security/ and there's more in its reference doc.
I could try to think if there's something what we could do to make this easier with Spring Security.
Parallel machines are on my todo list. It is a big topic so takes while to implement. Follow https://github.com/spring-projects/spring-statemachine/issues/35 and other related tickets. That issue is a foundation of making distributed state machines.

Related

(Golang) Clean Architecture - Who should do the orchestration?

I am trying to understand which of the following two options is the right approach and why.
Say we have GetHotelInfo(hotel_id) API that is being invoked from the Web till the Controller.
The logic of the GetHotelInfo is:
Invoke GetHotelPropertyData() (Location, facilities…)
Invoke GetHotelPrice(hotel_id, dates…)
Invoke GetHotelReviews(hotel_id)
Once all results come back, process and merge the data and return 1 object that contains all relevant data of the hotel.
Option 1:
Create 3 different repositories (HotelPropertyRepo, HotelPriceRepo,
HotelReviewsRepo)
Create GetHotelInfo usecase that will use these 3 repositories and
return the final result.
Option 2:
Create 3 different repositories (HotelPropertyRepo, HotelPriceRepo,
HotelReviewsRepo)
Create 3 different usecases (GetHotelPropertyDataUseCase,
GetHotelPriceUseCase, GetHotelReviewsUseCase)
Create GetHotelInfoUseCase that will orchestrate the previous 3
usecases. (It can also be a controller, but that’s a different topic)
Let’s say that right now only GetHotelInfo is being exposed to the Web but maybe in the future, I will expose some of the inner requests as well.
And would the answer be different if the actual logic of GetHotelInfo is not a combination of 3 endpoints but rather 10?
You can see a similar method (called Get()) in "Clean Architecture with GO" from Manato Kuroda
Manato points out that:
following Acyclic Dependencies Principle (ADP), the dependencies only point inward in the circle, not point outward and no circulation.
that Controller and Presenter are dependent on Use Case Input Port and Output Port which is defined as an interface, not as specific logic (the details). This is possible (without knowing the details in the outer layer) thanks to the Dependency Inversion Principle (DIP).
That is why, in example repository manakuro/golang-clean-architecture, Manato creates for the Use cases layer three directories:
repository,
presenter: in charge of Output Port
interactor: in charge of Input Port, with a set of methods of specific application business rules, depending on repository and presenter interface.
You can use that example, to adapt your case, with GetHotelInfo declared first in hotel_interactor.go file, and depending on specific business method declared in hotel_repository, and responses defined in hotel_presenter
Is expected Interactors (Use Case class) call other interactors. So, both approaches follow Clean Architecture principles.
But, the "maybe in the future" phrase goes against good design and architecture practices.
We can and should think the most abstract way so that we can favor reuse. But always keeping things simple and avoiding unnecessary complexity.
And would the answer be different if the actual logic of GetHotelInfo is not a combination of 3 endpoints but rather 10?
No, it would be the same. However, as you are designing APIs, in case you need the combination of dozens of endpoints, you should start considering put a GraphQL layer instead of adding complexity to the project.
Clean is not a well-defined term. Rather, you should be aiming to minimise the impact of change (adding or removing a service). And by "impact" I mean not only the cost and time factors but also the risk of introducing a regression (breaking a different part of the system that you're not meant to be touching).
To minimise the "impact of change" you would split these into separate services/bounded contexts and allow interaction only through events. The 'controller' would raise an event (on a shared bus) like 'hotel info request', and each separate service (property, price, and reviews) would respond independently and asynchronously (maybe on the same bus), leaving the controller to aggregate the results and return them to the client, which could be done after some period of time. If you code the result aggregator appropriately it would be possible to add new 'features' or remove existing ones completely independently of the others.
To improve on this you would then separate the read and write functionality of each context into its own context, each responding to appropriate events. This will allow you to optimise and scale the write function independently of the read function. We call this CQRS.

Scheduling tasks/messages for later processing/delivery

I'm creating a new service, and for that I have database entries (Mongo) that have a state field, which I need to update based on a current time, so, for instance, the start time was set to two hours from now, I need to change state from CREATED -> STARTED in database, and there can be multiple such states.
Approaches I've thought of:
Keep querying database entries that are <= current time and then change their states accordingly. This causes extra reads for no reason and half the time empty reads, and it will get complicated fast with more states coming in.
I write a job scheduler (I am using go, so that'd be not so hard), and schedule all the jobs, but I might lose queue data in case of a panic/crash.
I use some products like celery, have found a go implementation for it https://github.com/gocelery/gocelery
Another task scheduler I've found is on Google Cloud https://cloud.google.com/solutions/reliable-task-scheduling-compute-engine, but I don't want to get stuck in proprietary technologies.
I wanted to use some PubSub service for this, but I couldn't find one that has delayed messages (if that's a thing). My problem is mainly not being able to find an actual name for this problem, to be able to search for it properly, I've even tried searching Microsoft docs. If someone can point me in the right direction or if any of the approaches I've written are the ones I should use, please let me know, that would be a great help!
UPDATE:
Found one more solution by Netflix, for the same problem
https://medium.com/netflix-techblog/distributed-delay-queues-based-on-dynomite-6b31eca37fbc
I think you are right in that the problem you are trying to solve is the job or task scheduling problem.
One approach that many companies use is the system you are proposing: jobs are inserted into a datastore with a time to execute at and then that datastore can be polled for jobs to be run. There are optimizations that prevent extra reads like polling the database at a regular interval and using exponential back-off. The advantage of this system is that it is tolerant to node failure and the disadvantage is added complexity to the system.
Looking around, in addition to the one you linked (https://github.com/gocelery/gocelery) there are other implementations of this model (https://github.com/ajvb/kala or https://github.com/rakanalh/scheduler were ones I found after a quick search).
The other approach you described "schedule jobs in process" is very simple in go because goroutines which are parked are extremely cheap. It's simple to just spawn a goroutine for your work cheaply. This is simple but the downside is that if the process dies, the job is lost.
go func() {
<-time.After(expirationTime.Sub(time.Now()))
// do work here.
}()
A final approach that I have seen but wouldn't recommend is the callback model (something like https://gitlab.com/andreynech/dsched). This is where your service calls to another service (over http, grpc, etc.) and schedules a callback for a specific time. The advantage is that if you have multiple services in different languages, they can use the same scheduler.
Overall, before you decide on a solution, I would consider some trade-offs:
How acceptable is job loss? If it's ok that some jobs are lost a small percentage of the time, maybe an in-process solution is acceptable.
How long will jobs be waiting? If it's longer than the shutdown period of your host, maybe a datastore based solution is better.
Will you need to distribute job load across multiple machines? If you need to distribute the load, sharding and scheduling are tricky things and you might want to consider using a more off-the-shelf solution.
Good luck! Hope that helps.

Microservices: model sharing between bounded contexts

I am currently building a microservices-based application developed with the mean stack and am running into several situations where I need to share models between bounded contexts.
As an example, I have a User service that handles the registration process as well as login(generate jwt), logout, etc. I also have an File service which handles the uploading of profile pics and other images the user happens to upload. Additionally, I have an Friends service that keeps track of the associations between members.
Currently, I am adding the guid of the user from the user table used by the User service as well as the first, middle and last name fields to the File table and the Friend table. This way I can query for these fields whenever I need them in the other services(Friend and File) without needing to make any rest calls to get the information every time it is queried.
Here is the caveat:
The downside seems to be that I have to, I chose seneca with rabbitmq, notify the File and Friend tables whenever a user updates their information from the User table.
1) Should I be worried about the services getting too chatty?
2) Could this lead to any performance issues, if alot of updates take place over an hour, let's say?
3) in trying to isolate boundaries, I just am not seeing another way of pulling this off. What is the recommended approach to solving this issue and am I on the right track?
It's a trade off. I would personally not store the user details alongside the user identifier in the dependent services. But neither would I query the users service to get this information. What you probably need is some kind of read-model for the system as a whole, which can store this data in a way which is optimized for your particular needs (reporting, displaying together on a webpage etc).
The read-model is a pattern which is popular in the event-driven architecture space. There is a really good article that talks about these kinds of questions (in two parts):
https://www.infoq.com/articles/microservices-aggregates-events-cqrs-part-1-richardson
https://www.infoq.com/articles/microservices-aggregates-events-cqrs-part-2-richardson
Many common questions about microservices seem to be largely around the decomposition of a domain model, and how to overcome situations where requirements such as querying resist that decomposition. This article spells the options out clearly. Definitely worth the time to read.
In your specific case, it would mean that the File and Friends services would only need to store the primary key for the user. However, all services should publish state changes which can then be aggregated into a read-model.
If you are worry about a high volume of messages and high TPS for example 100,000 TPS for producing and consuming events I suggest that Instead of using RabbitMQ use apache Kafka or NATS (Go version because NATS has Rubby version also) in order to support a high volume of messages per second.
Also Regarding Database design you should design each micro-service base business capabilities and bounded-context according to domain driven design (DDD). so because unlike SOA it is suggested that each micro-service should has its own database then you should not be worried about normalization because you may have to repeat many structures, fields, tables and features for each microservice in order to keep them Decoupled from each other and letting them work independently to raise Availability and having scalability.
Also you can use Event sourcing + CQRS technique or Transaction Log Tailing to circumvent 2PC (2 Phase Commitment) - which is not recommended when implementing microservices - in order to exchange events between your microservices and manipulating states to have Eventual Consistency according to CAP theorem.

Eventual Consistency in microservice-based architecture temporarily limits functionality

I'll illustrate my question with Twitter. For example, Twitter has microservice-based architecture which means that different processes are in different servers and have different databases.
A new tweet appears, server A stored in its own database some data, generated new events and fired them. Server B and C didn't get these events at this point and didn't store anything in their databases nor processed anything.
The user that created the tweet wants to edit that tweet. To achieve that, all three services A, B, C should have processed all events and stored to db all required data, but service B and C aren't consistent yet. That means that we are not able to provide edit functionality at the moment.
As I can see, a possible workaround could be in switching to immediate consistency, but that will take away all microservice-based architecture benefits and probably could cause problems with tight coupling.
Another workaround is to restrict user's actions for some time till data aren't consistent across all necessary services. Probably a solution, depends on customer and his business requirements.
And another workaround is to add additional logic or probably service D that will store edits as user's actions and apply them to data only when they will be consistent. Drawback is very increased complexity of the system.
And there are two-phase commits, but that's 1) not really reliable 2) slow.
I think slowness is a huge drawback in case of such loads as Twitter has. But probably it could be solved, whereas lack of reliability cannot, again, without increased complexity of a solution.
So, the questions are:
Are there any nice solutions to the illustrated situation or only things that I mentioned as workarounds? Maybe some programming platforms or databases?
Do I misunderstood something and some of workarounds aren't correct?
Is there any other approach except Eventual Consistency that will guarantee that all data will be stored and all necessary actions will be executed by other services?
Why Eventual Consistency has been picked for this use case? As I can see, right now it is the only way to guarantee that some data will be stored or some action will be performed if we are talking about event-driven approach when some of services will start their work when some event is fired, and following my example, that event would be “tweet is created”. So, in case if services B and C go down, I need to be able to perform action successfully when they will be up again.
Things I would like to achieve are: reliability, ability to bear high loads, adequate complexity of solution. Any links on any related subjects will be very much appreciated.
If there are natural limitations of this approach and what I want cannot be achieved using this paradigm, it is okay too. I just need to know that this problem really isn't solved yet.
It is all about tradeoffs. With eventual consistency in your example it may mean that the user cannot edit for maybe a few seconds since most of the eventual consistent technologies would not take too long to replicate the data across nodes. So in this use case it is absolutely acceptable since users are pretty slow in their actions.
For example :
MongoDB is consistent by default: reads and writes are issued to the
primary member of a replica set. Applications can optionally read from
secondary replicas, where data is eventually consistent by default.
from official MongoDB FAQ
Another alternative that is getting more popular is to use a streaming platform such as Apache Kafka where it is up to your architecture design how fast the stream consumer will process the data (for eventual consistency). Since the stream platform is very fast it is mostly only up to the speed of your stream processor to make the data available at the right place. So we are talking about milliseconds and not even seconds in most cases.
The key thing in these sorts of architectures is to have each service be autonomous when it comes to writes: it can take the write even if none of the other application-level services are up.
So in the example of a twitter like service, you would model it as
Service A manages the content of a post
So when a user makes a post, a write happens in Service A's DB and from that instant the post can be edited because editing is just a request to A.
If there's some other service that consumes the "post content" change events from A and after a "new post" event exposes some functionality, that functionality isn't going to be exposed until that service sees the event (yay tautologies). But that's just physics: the sun could have gone supernova five minutes ago and we can't take any action (not that we could have) until we "see the light".

ZeroMQ to send messages between systems

I am very much new to the ZeroMQ library.
Hence I wanted to know the pattern ( REQ-REP, PUSH-PULL, PUB-SUB ) that will be the best for our application.
The application which we are using has two systems,
the one which the user interacts with
and
the second is the scheduler, which executes a job, scheduled by the user in the first system.
Now I want to make use of ZeroMQ to send messages in the below scenarios:
from userSystem to schedulerSystem that a job with particular job id is submitted for execution.
from schedulerSystem to userSystem that the job sent with a particular job id has been executed succesfully or the execution has failed
Can somebody please help with this,
stating the reason for using a particular pattern?
Thanks in advance.
Which is the best Formal Communication Pattern to use? None...
Dear Ann,with all due respect, nobody would try to seriously answer a question which of all the possible phone numbers is the best for any kind of use.
Why? There is simply no Swiss-Army-Knife for doing just anything.
That is surprisingly the good news.
As a system designer one may create The Right Solution on a green-field, using the just-enough design strategies for not doing more than necessary ( overhead-wise ) and have all the pluses on your design side ( scaleability-wise, low-latency-wise, memory-footprint-wise, etc. )
If no other requirements than (1) and (2) above appear,a light-weight schemelike this may work fine as an MVP "just-enough" design:
If userSystem does not process anything depending on a schedulerSystem output value, a PUSH-PULL might be an option for sending a job, with possible extensions.
For userSystem receiving independent, asynchronously organised state-reporting messages about respective jobID return code(s), again a receiver side poll-ed PUSH-PULL might work well.
Why? Otherwise natural unstructured behaviour-wise PAIR-PAIR disallows your processing from growing in scale once performance or architecture or both demand to move. PAIR-PAIR does not allow your communication framework to join more entities together, while others do and your processing power may go distributed until your IP-visibility and end-to-end latency permit.
The real world is typically much more complex
Just one picture, Fig.60 from the below-mentioned book:
The best next step?
To see a bigger picture on this subject >>> with more arguments, a simple signalling-plane picture and a direct link to a must-read book from Pieter HINTJENS.

Resources