I'm currently making a watchdog to check if all bundles in a pipeline are still functioning properly. (This will be in a distributed environment so failure can be a network failure, software failure, one of the servers failing, ...)
Because a bundle can be bound to N amount of services, N arbitrary, the checking should will happen recursively using the following methodology:
START at the first step in the pipeline
Use getServicesInUse to get the services references of the next step
use getBundle() on the gathered ServiceRerefence objects
REPEAT until we arrive at the bundle we want to stop at
So that way I can get all the bundle objects of the pipeline (I assume) now to check if they are functioning correctly (or just if they are still reachable) I was wondering if
Bundle b = ...
if(b.getState() == Bundle.ACTIVE) ...;
will do the trick? Ofcourse also surrounding this with the necessary try catch clauses to detect hardware/network failure.
Can you clarify what you mean by "all bundles in a pipeline"?
You are right that a bundle can provide and consume zero or more services, but if I were to create a watchdog for an OSGi system I would use one of two approaches:
If the nodes in your distributed system provide mainly REST services, I would write a separate "watchdog" program that monitors these REST services to see if they still respond (on any of the nodes in my distributed system). You can either make "real" calls or just request some HEAD and see if you get a response.
If the nodes in your distributed system provide mainly OSGi services, I would write a watchdog bundle and deploy that to each node. I would then add a REST endpoint to my watchdog to allow me to monitor it remotely (by another watchdog, similar to approach #1).
Checking the active state of a bundle will tell you nothing. Bundles will remain active once started, but the services they provide could be unresponsive.
Related
I've just learned how to use notifications and subscriptions in Chef to carry out actions such as restarting services if a config file is changed.
I am still learning chef so may just have not got to this section yet but I'd like to know how to do the actions conditionally.
Eg1 if I change a config file for my stand alone apache server I only want to restart the service if we are outside core business hours ie the current local time is between 6pm and 6am. If we are in core business hours I want the restart to happen but at a later time, outside core hours.
Eg2 if I change a config file for my load balanced apache server cluster I only want restart the service if a) the load balancer service status is "running" and b) all other nodes in the cluster have their apache service status as running ie I'm not taking down more than one node in the cluster at once.
I imagine we might need to put the action in a ruby block that either loops until the conditions are met or sets a flag or creates a scheduled task to execute later but I have no idea what to look for to learn how best to do this.
I guess this topic is kind of philosophical. For me, Chef should not have a specific state or logic beyond the current node and run. If I want to restart at a specific time, I would create a cron job with a conditional and just set the conditional with chef (Something like debian's /var/run/reboot-required). Then crond would trigger the reboot.
For your second example, the LB should have no issues to deal with a restarting apache backend and failover to another backend. Given that Chef runs regulary with something called "splay" the probability is very low that no backend is reachable. Even with only 2 backends. That said, reloading may be the better way.
I'm implementing API which allows to launch other apps (using NSTask) inside VFS (FUSE on macOS). After VFS is mounted a bunch of processes start accessing launched VFS in which my app works, and I'd like to implement some kind of filtering mechnism which will allow to detect whether process which is accessing the VFS is created by system (and potentially safe) or not, and if so it'll be granted an access to the file system where my app runs.
So far I'm able to get basic information of the process by it's pid. For example: process path, uid, ppid, code signature of the process etc (using Security framework, libproc etc)
I've done a couple of tests and see that there are process with uid != 0 and still critical for my app to run (if I deny access to them app which is started in VFS crashes) (e.g. /usr/libexec/secinitd, /System/Library/CoreServices/Dock.app/Contents/MacOS/Dock), so looks like approach with filtering processes by pids, uids, ppids might not work.
So the question is: is it possible to distinguish whether process which is accessing my app was created by system and is potentially safe? I also don't want to do too much work by denying accees to critical system processes which will allow the app to successfully start and run in VFS.
Judging from the comment thread, your threat model is data theft via malware etc.
In this case, you can trust almost nothing, so the best way is probably to maintain an explicit whitelist of processes which are allowed to access your mount point, and block access to everything else by default. Log any processes to which access is denied, and allow the user to reverse that decision and add them to the whitelist. In other words, let the user decide what applications they consider safe.
Your said that according to your inspection, there were several processes which were mandatory for the process to run, so why won't use try-and-error approach.
You deploy you FUSE drive on clean environment and record all processes that attempt to access your files - try to prevent each process and keep only those which crash your apps, and add them to a white-list.
Of course that this list is subject to change in different macOS versions, but it can give you the general idea.
Alternatively, you can break your app into couple of parts. for example, put the sensitive logic inside separated dylib file, and prevent access to this file only.. since dylib is not the main executable in your app, I believe fewer processes require mandatory access it.
I'm trying to get to grips with service fabric and I'm struggling a little bit. Some questions:
are all service fabric service instances single-threaded? I created a stateless web api, one instance, with a method that did a Task.Delay, then returned a string. Two requests to this service were served one after the other, not concurrently. So am I right in thinking then that the number of concurrent requests that can be served is purely a function of the service instance count in the application manifest? Edit Thinking about this, it is probably to do with the set up of OWIN Wep Api. Could it be it is blocking by session? I assumed there is no session by default?
I have long-running operations that I need to perform in service fabric (that can take several hours). Is there a recommended pattern that I can use for this in service fabric? These are currently handled using a storage queue that triggers a webjob. Maybe something with Reliable Queues and a RunAsync loop?
It seems you handled the first part so I will comment on the second part: "long-running operations".
We can see long running operations / workflows being handled far before service fabric came about. For this reason, we can build on the shoulders of giants by looking on the design patterns that software experts have been using for decades. For example, the famous and all inclusive Process Manager. Mind you that this pattern is sometimes an overkill. If it is in your case, just check out the rest of the related patterns in the Enterprise Integration Patterns book (by Gregor Hohpe).
As for the use of reliable collections, those are implementation details when choosing a data structure supporting the chosen design pattern.
I hope that helps
With regards to your second point - It really depends on the nature of your long running task.
Is your long running task the kind of workload that runs on an isolated thread that depends on local OS/VM level resources and eventually comes back with a result (A)? or is it the kind of long running task that goes through stages and builds up a model of the result through a series of persisted state changes (B)?
From what I understand of Service Fabric, it isn't really designed for running long running workloads (A), but more for writing horizontally-scalable, highly-available systems.
If you were absolutely keen on using service fabric (and your kind of workload tends to be more like B than A) I would definitely find a way to break down those long running tasks that could be processed in parallel across the cluster. But even then, there is probably more appropriate technologies designed for this such as Azure Batch?
P.s. If you are going to put a long running process in the RunAsync method, you should design the workload so it is interruptable and its state can be persisted in a way that can be resumed from another node in the cluster
In a stateful service, only the primary replica has write access to
state and thus is generally when the service is performing actual
work. The RunAsync method in a stateful service is executed only when
the stateful service replica is primary. The RunAsync method is
cancelled when a primary replica's role changes away from primary, as
well as during the close and abort events.
P.s.s Long running operations are the devil when trying to write scalable systems. Try and tackle that now and save yourself the future pain if possibe.
To the first point - this is purely a client issue. Chrome saw my requests as indentical and so delayed the 2nd request until the 1st got a response. Varying the parameter of the requests allowed them to be served concurrently.
I am wondering what would be the best practice for deploying updates to a (MVC) Go web application. Imagine the following scenario :
1) Code and test some changes for my Go Web Application
2) Deploy update without anyone currently using the previous version getting interrupted.
I don't know how to make sure point 2) can be covered - when somebody is sending a request to the server and I rebuild/restart it just in this moment, he gets an error - even if the request just uses a part of the code I did not touch or that is backwards-compatible, or if I just added a new Request-handler.
Maybe I'm missing something trivial or a well-known pattern as I am just in the process of learning go and my previous web applications were ASP.NET- or php-applications where this was no issue as I did not need to restart the webserver on code changes.
It's not just an issue with Go, but in general we can divide the problem into two separate ones:
Making sure current requests do not get terminated and affect user experience.
Making sure there is no down-time in which new requests cannot be handled.
The first one is easier to tackle: You just don't violently kill your server, but tell it to exit, causing a "Drain phase", in which it does not accept new requests and only finishes the currently running requests, and exits. This can be done by listening on signals for example, and entering the app into a special state.
It's not trivial with Go as the default http server doesn't support shutting it down, but you can start a server with a net.Listener, and then keep a reference to it an close it when the time is due.
Now, doing only approach one and then starting the service again will cause new requests not to be accepted while this is going on, and we all know this can take a number of seconds in extreme cases.
So what we need is another instance of the server already running with the new code, the instant the old one is not responding to new requests, right? That can be done in several ways:
Having more than one server, and a load-balancer on top of them, allowing one (or more) server to take the load while we restart another. That's the simplest way, and the way most people do it. If you need N servers to take the load of your users, just keep N+1 and restart one at a time.
Using socket sharing tricks. In Newer Linux kernels, Many processes can listen and accept on the same port. What you do is simply start the new instance and then tell the old one to finish and exit. This way there is no pause. This is done by setting SO_REUSEPORT on the listening socket.
The above can be automated with ready to ship solutions, like Einhorn, that deals with all the details for you, see https://github.com/stripe/einhorn
Another approach is documented in this blog post: http://blog.nella.org/?p=879
I am trying to automate the startup of my Selenium Grid.
I have the Hub registered as a service, so that starts when the machine starts, but
the literature tells me I can't do the same with the node, because it won't be in a User context, and so I would not be able to get screenshots etc.
I've seen vague hints that you can add something to the registry to start a program, but I'm not really convinced thats what I want.
IT pulls down the servers for upgrades at intervals, and sessions are set to time out after X amount of inactivity, so its a tedious and silly process to open remote desktops to all 6 nodes, in order to log in, then start the process every time.
How do you best manage this?
- Configure the machines to auto-login, and place startSeleniumNode.bat in that users start-folder?
- Add some kind of commandline entry in the jenkins build script that launches the test, to call each of the 6 nodes in turn to start the selenium node (and how would you do that?)
Take a look at AlwaysUp - it allows you to run almost any application as a Windows service including Selenium Grid hubs and nodes.
I've previously created a fairly large Grid infrastructure using AlwaysUp for node management. It's very useful for starting up Grid on boot and lets you specify a user account to run as, schedule restarts at regular intervals and a lot more.