How to share large files between two microservices in Mesos? - microservices

I have a mesos cluster and I need to run two types of microservices, one is producing very large files (might be more than 2GB for file) the other one is analyzing those files. The analyzing microservice is taking more time than the producer service.
After the analysis service is done - the file can be deleted.
I thought of two options:
NFS - producer service creates all files on NFS and the analysis service is taking it directly from the shared folder. (I'm concerned that this approach will consume all internal bandwidth in my cluster)
Local Disk (my preferred) - in this case I need to somehow enforce the analysis micoroservice to run on the same Mesos slave as the producer service that created this specific file. (I'm not sure this approach is possible)
What would be best practice in this case?

I guess this can be implemented in different ways, depending on your requirements:
If you want to be able to handle a host (agent) failure, I think there is no other way than using a shared filesystem such as NFS. Otherwise, if you use Marathon to schedule your Microservices, the task will be restarted on another agent (where the data isn't locally available). Also, you would then need to make sure that the same mount points are available on each agent, and use these as host volumes in your containers. Unfortunately, the POD feature availability for co-locating tasks starts to be available in Mesos 1.1.0 and Marathon 1.4 (not yet finally released), as a side note...
If you don't care about host (agent) failures, then you possible could co-locate the two Microservices on the same agent if you use hostname constraints in Marathon, and mount the host volumes which then can be shared across the services. I guess you'd need some orchestration to only start the analysis service only after the producing service has finished.

Related

OpenTelemetry for short-lived scripts?

Our system consists of many python scripts that are run on "clean" machines, that is, they need to have as little additional software on them as possible. Is there a way we could use OpenTelemetry without having to run additional servers on those machines? Is there a push model for sending data instead of pull?
Considering your additional explanation I imagine you will eventually want to collect all telemetry from these systems. Using OTLP exporters you can send all three signals traces, metrics, logs to collector service (As of now only tracing is stable and metrics, logs work is experimental). You would not have to run any additional servers on these resource constrained servers for your use case. There are two deployments strategies recommended for opentelemetry collector.
As an agent - Runs along with the application on same host machine.
As a gateway - Runs on standalone server outside the application host machine.
Running collector agent on same application host machine offloads some of the work from language client libs and enhances the telemetry but can be resource incentive.
Read more about collector here https://opentelemetry.io/docs/collector/getting-started/

NServiceBus: is this an illusion or a best practice?

Given an NServiceBus microservice that uses MSMQ, When I deploy few instances of that service into the same machine, Am I scaling out my application?, Am I improving the performance? or one instance is enough. shall I instead have a more powerful machine to handle messages?
No, running multiple instances on a single machine will not make things run faster, it is only making execution less efficient.
However, it might be that a single instance isn't giving you the expected performance even though your system monitoring indicates there are plenty of resources to spend but not used. In that case you might want tweak the configuration of your NServiceBus endpoint by configuration the amount of allowed parallel message execution.
On the following link you see how you can increase the concurrency:
https://docs.particular.net/nservicebus/operations/tuning
You can further scaleout by actually using multiple machines but if all these endpoints share the same central database your network or database server can easily become the bottleneck. If you consider deploying or scaling out your endpoints across multiple machines make sure that any storage solutions are also scaled out for these not to become your bottleneck.
Zero downtime upgrades/deployments
The only reason to have multiple instance on the same box is for example when deploying a new version, you can temporarily run the current and the new version side-by-side to achieve zero downtime deployments.

File sync between n web servers in cluster

There are n nodes in a web cluster. Files may be uploaded to any node and then must be distributed to every other node. This distribution does not have to happen in a transaction (in fact it must not, distributed transactions don't scale) and some latency is acceptable, although must be minimal. Conflicts can be resolved arbitrarily (typically last write wins) provided that the resolution is also distributed to all nodes so that eventually all nodes have the same set of files. Nodes can be added and removed dynamically without having to reconfigure existing nodes. There must be no single point of failure and no additional boxes required to solve this (such as RabbitMQ)
I am thinking along the lines of using consul.io for dynamic configuration so that each node can refer to consul to determine what other nodes are available and writing a daemon (Golang) that monitors the relevant folders and communicates with other nodes using ZeroMQ.
Feels like I would be re-inventing the wheel though. This is a common problem and I expect there are solutions available already that I don't know about? Or perhaps my approach is wrong and there is another way to solve this?
Yes, there has been some stuff going on with distributed synchronization lately:
You could use syncthing (open source) or BitTorrent Sync.
Syncthing is node-based, i.e. you add nodes to a cluster and choose which folders to synchronize.
BTSync is folder-based, i.e. you obtain a "secret" for a folder and can synchronize with everyone in the swarm for that folder.
From my experience, BTSync has a better discovery and connectivity, but the whole synchronization process is closed source and nobody really knows what happens. Syncthing is written in go, but sometimes has trouble discovering peers.
Both syncthing and BTSync use LAN discovery via broadcast and a tracker for discovery, AFAIK.
EDIT: Or, if you're really cool, use IPFS to host the latest version, IPNS to "name" that and mount the IPNS on the servers. You can set the IPFS bootstrap list to some of your servers, which would even make you independent of external trackers. :)

Does the kubernetes scheduler support anti-affinity?

I'm looking at deploying Kubernetes on top of a CoreOS cluster, but I think I've run into a deal breaker of sorts.
If I'm using just CoreOS and fleet, I can specify within the unit files that I want certain services to not run on the same physical machine as other services (anti-affinity). This is sort of essential for high availability. But it doesn't look like kubernetes has this functionality yet.
In my specific use-case, I'm going to need to run a few clusters of elasticsearch machines that need to always be available. If, for any reason, kubernetes decides to schedule all of my elasticsearch node containers for a given ES cluster on a single machine, (or even the majority on a single machine), and that machine dies, then my elasticsearch cluster will die with it. That can't be allowed to happen.
It seems like there could be work-arounds. I could set up the resource requirements and machine specs such that only one elasticsearch instance could fit on each machine. Or I could probably use labels in some way to specify that certain elasticsearch containers should go on certain machines. I could also just provision way more machines than necessary, and way more ES nodes than necessary, and assume kubernetes will spread them out enough to be reasonably certain of high availability.
But all of that seems awkward. It's much more elegant from a resource-management standpoint to just specify required hardware and anti-affinity, and let the scheduler optimize from there.
So does Kubernetes support anti-affinity in some way I couldn't find? Or does anyone know if it will any time soon?
Or should I be thinking about this another way? Do I have to write my own scheduler?
Looks like there are a few ways that kubernetes decides how to spread containers, and these are in active development.
Firstly, of course there have to be the necessary resources on any machine for the scheduler to consider bringing up a pod there.
After that, kubernetes spreads pods by replication controller, attempting to keep the different instances created by a given replication controller on different nodes.
It seems like there was recently implemented a method of scheduling that considers services and various other parameters. https://github.com/GoogleCloudPlatform/kubernetes/pull/2906 Though I'm not completely clear on exactly how to use it. Perhaps in coordination with this scheduler config? https://github.com/GoogleCloudPlatform/kubernetes/pull/4674
Probably the most interesting issue to me is that none of these scheduling priorities are considered during scale-down, only scale-up. https://github.com/GoogleCloudPlatform/kubernetes/issues/4301 That's a bit of big deal, it seems like over time you could weird distributions of pods because they stay whereever they are originally placed.
Overall, I think the answer to my question at the moment is that this is an area of kubernetes that is in flux (as to be expected with pre-v1). However, it looks like much of what I need will be done automatically with sufficient nodes, and proper use of replication controllers and services.
Anti-Affinity is when you don’t want certain pods to run on certain nodes. For the present scenario I think TAINTS AND TOLERATIONS can come handy. You can mark nodes with a taint, then ,only those pods which explicitly have tolerance for that taint are going to be allowed to run on that specific node.
Below I am describing how can anti-affinity concept be implemented:
Taint any node you want.
kubectl taint nodes gke-cluster1-default-pool-4db7fabf-726z env=dev:NoSchedule
here, env=dev is the key:value pair or rather the label for this node ,
NoSchedule is the effect which describes not to schedule any pod on this node unless with the specific toleration.
Create a deployment
kubectl run newdep1 --image=nginx
Add the same label as the label to the node to this deployment so that all pods of this deployment are hosted on this node and this node will not host any other pod without the matching label.
kubectl label deployments/newdep1 env=dev
kubectl scale deployments/newdep1 --replicas=5
kubectl get pods -o wide
running this you will see all pods of newdep1 will run on your tainted node.

Is it safe to use etcd across multiple data centers?

Is it safe to use etcd across multiple data centers? As it expose etcd port to public internet.
Do I have to use client certificates in this case or etcd has some sort of authification?
Yes, but there are two big issues you need to tackle:
Security. This all depends on what type of info you are storing in etcd. Using a point to point VPN is probably preferred over exposing the entire cluster to the internet. Client certificates can also be used.
Tuning. etcd relies on replication between machines for two things, aliveness and consensus. Since a successful write must be committed to at majority of the cluster before it returns as successful, your write performance will degrade as the distance between the machines increases. Aliveness is measured with periodic heartbeats between the machines. By default, etcd has a fairly aggressive 50ms heartbeat timeout, which is optimized for bare metal servers running on a local network. Without tuning this timeout value, your cluster will constantly think that members have disappeared and trigger frequent master elections. This gets worse if both of your environments are on cloud providers that have variable networks plus disk writes that traverse the network, a double whammy.
More info on etcd tuning: https://etcd.io/docs/latest/tuning/

Resources