In SOA practice, what strategies work better (or work at all) to update long running processes (in particular for Oracle BPEL)? For example, process may involve several human steps, which by their nature are time consuming. SOA Suites support starting new instances on new version of process and continue of running processes execution. But, what to do if the orchestration logic need to be updated and applied to already running instances? Let assume we do not want purchase orders to pass management approval, and would like this change to be applied to all orders, even those beying executed.
You cannot change the business process for anything which is in flight. Changes can only be applied to new processes. This is not a technical limitation, it is just common sense. Apart from anything, it would confuse audit trails or regulatory compliance.
If you have so catastrophically mis-designed a process - "we forgot to include management approval for orders!" "facepalm* - all you can do is shut off the server and clean up any half-completed processes. But that would be a really drastic step to take.
So the only strategy which is going to work is review and acceptance testing.
Related
Chaos engineering practices are becoming very widely used. One common example is Netflix' own Chaos Monkey. However, Chaos Monkey is often run ad-hoc against random targets. I'm curious how chaos experiments might work in a typical CI/CD pipeline to enhance a specific service's resiliency.
Since chaos experiments (usually) require a fully functional environment, when would they run? Would it run parallel to testing, or downstream?
Would you run a chaos experiment with every commit, or just some?
How long would allow the chaos experiments to run? A 60 minute CPU spike might interfere with a "fail fast" approach, for example.
Would a chaos experiment ever fail the pipeline? What would constitute a 'failure'?
We are just getting started with our chaos engineering efforts, but I'll offer some thoughts regarding your questions.
There are at least three distinct classes of experiment:
Instance/container kills that we expect the underlying infrastructure to handle automatically.
Higher-level but fairly localized failures like slow or unavailable dependencies.
Large-scale failures like data center or region down.
For a build pipeline the sweet spot would be in the middle there (i.e. higher-level but localized failures), because usually the software itself plays a role in responding to the failure. For example the software might include a circuit breaker that trips, throttling, automated failover, etc. If those are software functions, then they can either work or not work, and the build should uncover that.
To the extent that resiliency to failure is a system requirement, then yeah, a failed experiment would fail the pipeline. Suppose for instance that build 392 has a correctly working circuit breaker, and that build 393 doesn't. That would be a failure since the build goes from meeting the requirement to not.
We usually have some chaos experiments, like large-scale failures outside the pipeline.
During the build pipeline, we usually combine chaos experiments with a short performance test to simulate activity and then kill some instances/container to check the resilience of the system. And fail if the system is not able to recover.
I'm trying to get to grips with service fabric and I'm struggling a little bit. Some questions:
are all service fabric service instances single-threaded? I created a stateless web api, one instance, with a method that did a Task.Delay, then returned a string. Two requests to this service were served one after the other, not concurrently. So am I right in thinking then that the number of concurrent requests that can be served is purely a function of the service instance count in the application manifest? Edit Thinking about this, it is probably to do with the set up of OWIN Wep Api. Could it be it is blocking by session? I assumed there is no session by default?
I have long-running operations that I need to perform in service fabric (that can take several hours). Is there a recommended pattern that I can use for this in service fabric? These are currently handled using a storage queue that triggers a webjob. Maybe something with Reliable Queues and a RunAsync loop?
It seems you handled the first part so I will comment on the second part: "long-running operations".
We can see long running operations / workflows being handled far before service fabric came about. For this reason, we can build on the shoulders of giants by looking on the design patterns that software experts have been using for decades. For example, the famous and all inclusive Process Manager. Mind you that this pattern is sometimes an overkill. If it is in your case, just check out the rest of the related patterns in the Enterprise Integration Patterns book (by Gregor Hohpe).
As for the use of reliable collections, those are implementation details when choosing a data structure supporting the chosen design pattern.
I hope that helps
With regards to your second point - It really depends on the nature of your long running task.
Is your long running task the kind of workload that runs on an isolated thread that depends on local OS/VM level resources and eventually comes back with a result (A)? or is it the kind of long running task that goes through stages and builds up a model of the result through a series of persisted state changes (B)?
From what I understand of Service Fabric, it isn't really designed for running long running workloads (A), but more for writing horizontally-scalable, highly-available systems.
If you were absolutely keen on using service fabric (and your kind of workload tends to be more like B than A) I would definitely find a way to break down those long running tasks that could be processed in parallel across the cluster. But even then, there is probably more appropriate technologies designed for this such as Azure Batch?
P.s. If you are going to put a long running process in the RunAsync method, you should design the workload so it is interruptable and its state can be persisted in a way that can be resumed from another node in the cluster
In a stateful service, only the primary replica has write access to
state and thus is generally when the service is performing actual
work. The RunAsync method in a stateful service is executed only when
the stateful service replica is primary. The RunAsync method is
cancelled when a primary replica's role changes away from primary, as
well as during the close and abort events.
P.s.s Long running operations are the devil when trying to write scalable systems. Try and tackle that now and save yourself the future pain if possibe.
To the first point - this is purely a client issue. Chrome saw my requests as indentical and so delayed the 2nd request until the 1st got a response. Varying the parameter of the requests allowed them to be served concurrently.
I have a workflow that runs when an entity is created and it creates two other entities and puts them on a queue. It then waits until each entity's status reason is set to done. After which is continues.
Basically two teams will work an order and then it will continue processing after both teams are done.
Most of the time it works. However sometimes it waits forever. I'll re-active and re-resolve the other tasks, but it just never wakes up.
What can I do? The workflows aren't really powerful enough for me to have it poll with a timeout (there are no loops). I'd like to avoid on-change plugins for these other entities to get workflow behavior all scattered about.
Edit:
Restarting the CRM services (not sure which did it, I restarted them all) allowed the workflow to resume. However, I'd still like to know how to make this more reliable.
I had the same problem (and a lot more) with workflows in CRM 2011 and decided not to use them (except for very special purposes).
The main reason is because of their very limited error handling. Another reason is that it is inconvenient to put them under source control. Another reasons are: Worflows cannot run offline and user impersonation is also not supported. For a comparison look here: http://goo.gl/9ht1QJ
Use plugins instead of workflows, then you have full control.
But keep in mind that plugins (unlike workflows) are not designed for long running tasks.
So they have a default max execution time of 120 sec and are not stateful/persisted. But in most cases (and i think also in your case) that is not a problem.
Just change your eventing a little bit:
Implement and register a plugin step for: entity is created and it creates two other entities and puts them on a queue
Implement and register another step: entity's status reason is set to done, query for other entity and check status, if done continue processing
If you really do not want use plugins for you business logic you can consider implementing a plugin which restarts/resumes faulted workflows.
But thats not a very nice solution.
I am considering to use the Oracle Advanced Queueing technology for asynchronous communication. My aim is to use it for concurrent process execution (asynchronous PL/SQL procedure calls).
The current legacy implementation for the concurrent process execution is made of Unix KornShell (ksh) scripts which we are starting from front end via SSH connection in background mode. It works fine for us, but I am unhappy with that kind of solution because of:
Security (front end starts a SSH connection and executes ksh scripts in background mode. From our colleagues I noticed that this kind of login will be restricted in our company.)
Maintenance (Not everyone of our team is familiar with ksh scripts)
Diversity in technology (I try to decrease the diversity in technology because of know how and migration efforts)
Logging (Our back end system logs into database log tables, the concurrent execution logs partially into a log file)
By moving from ksh to the database I will be able to increase overall quality of my system:
Security (No SSH connections anymore, the front end will send messages to the database and the database message listener will react to the messages and execute procedures asynchronously)
Maintenance (We use PL/SQL, where we are familiar in)
Diversity in technology (By next OS migration we will need to migrate only the database objects and the data)
Logging (We will fully use our back end logging solution)
What do you think about my considerations and what are your experiences with Oracle Advanced Queueing? Especially in stability, performance and maintenance? Are there better alternatives?
I obviously don't know the details of your project, but if asynchronous PL/SQL procedure calls is your only goal, it may be easier to use DBMS_SCHEDULER. Your program could submit jobs to "run now" through the scheduler that call your PL/SQL. In my opinion, the scheduler is a much easier thing to work with than AQs.
The management of flows with Asynchronous queues Oracle brings with it advantages and disadvantages:
ADVANTAGES
Ability to manage flows by type creating ad hoc code on which to
create Handler (JOB EVENT or APPLY PROCESS) to manage the various
Sub Flows
Easy to put out a whole type of flows closing DEQUEUE Queue.
Managing Priorities (Parameter in the creation of Coda) of MSG for
INSERT TIME or PRIORITY (msg parameter in the Payload) Managing
message with a deadline or an Elapsed TIME.
Align the paradigm to a solution to EVENT no POLLING
DISADVANTAGES
The load of Business Logic will all be on the DB.
When Installing New PKG you will need to stop the queues (queuing and DEQUEUEING) to restart the HANDLER that point to the PKG.
Having to implement a recovery system msg Incorrectly Processed.
I think a good solution would be to use the CODE JMS (JMS provider) on the tails ORACLE so as to move the BL on JAVA and to use the various potentials of the language including the Logging.
is there any way i can make my records in the database to be automatic. e.g i want a message to be sent to helpdesk if a requested service is not attended within 24 hours, without clicking anything.
technically it depends on the database you are using. if the database supports it, you could set up a scheduled job to scan the records and identify late services and email the helpdesk.
if the database doesn't support scheduled tasks then you could set up a client job on a timer to do the same thing.
This is what application software is for.
When the application saves to the database, the application also sends an email.
The traditional approach to this is to schedule a job (there are too many ways[1] to do that for me to go into details without knowing your server operating system, DBMS, and how much control you have to install or schedule programs on the server).
Your scheduled job would regularly check the database for records that have not been attended, and then take the appropriate action such as emailing the support team.
[1] Just so that this is not left completely unanswered; some DBMS (ex. SQL Server) have built in job scheduling facilities. You could run a Windows service on the server to do this. If not, you might consider running a Windows Service on one of your own servers to access the website (a great way to waste bandwidth).
Use a scheduler like this one, found on rufus site. You could program it to run, for instance, every hour, and make it do the job without human interaction.
I am a Java shop myself and I've been using quartz. It is quite good and usable if you can adjust to jruby.
I've never liked database or operating system based solutions, since you might not control them and often get asked to run on different environments.
Here's a very simple background job handler for Ruby:
codeforpeople.rubyforge.org/svn/bj/trunk/README
Easy to install and use. Fairly lightweight. It uses a SQL backend for managing concurrency. Runs on multiple machines simultaneously if you need it to.