How to monitor Apache Mesos? - mesos

I run a Mesos production instance.
It is important to keep this system up and running.
Is it possible to setup some sort of monitoring around a TCP/HTTP endpoint?

http://<fqdn>:5050/master/health
That API returns HTTP Status code 200 for GET calls.
One could use Zabbix or other alternatives to monitor a Mesos Master

Related

Concepts to write code to monitor running application on the server

Requirement:-
I have to write code to monitor all the running applications on the server and give their name as output if it's down.
Research:-
During my research I found that:-
There are several tools like azure and monito that themselves monitor all the applications but this does not match our requirements.
We can write code that can check all the running services on the local desktop or the server and from there we can also check the running status of the required applications and if the status is stopped or sleep then we can easily notify.
We can send requests to the deployed URL at some regular interval and if we get a response status rather than 200 then we can notify the user as something is wrong and this particular website is not working.
If anyone can through some light on this and can suggest some more methods or references from their experience, it will be highly appreciated.

OpenTelemetry for short-lived scripts?

Our system consists of many python scripts that are run on "clean" machines, that is, they need to have as little additional software on them as possible. Is there a way we could use OpenTelemetry without having to run additional servers on those machines? Is there a push model for sending data instead of pull?
Considering your additional explanation I imagine you will eventually want to collect all telemetry from these systems. Using OTLP exporters you can send all three signals traces, metrics, logs to collector service (As of now only tracing is stable and metrics, logs work is experimental). You would not have to run any additional servers on these resource constrained servers for your use case. There are two deployments strategies recommended for opentelemetry collector.
As an agent - Runs along with the application on same host machine.
As a gateway - Runs on standalone server outside the application host machine.
Running collector agent on same application host machine offloads some of the work from language client libs and enhances the telemetry but can be resource incentive.
Read more about collector here https://opentelemetry.io/docs/collector/getting-started/

testing performance on Linux dockers

I have several scripts testing performance of Linux server. there are about 12 dockers containers running inside the Linux.
We are interested on collecting also metrics of containers (right now we are collecting only of the Linux machine itself)
Is there any plugin for this? or can this be done with the Perfmon plugin?
There are several ways of monitoring Docker instance statistics:
Built-in Runtime Metrics
cAdvisor - probably this one will be the easiest to setup and use
Any of built-in Linux monitoring tools
Of course you can normally use JMeter PerfMon Plugin as this way you will get performance monitoring results integrated into your test script and be able to correlate JMeter metrics with server health metrics. Just make sure there is a TCP/UDP connectivity between JMeter and PerfMon Server Agent, default port is 4445 so make sure container exposes this port to the outside world.

How to share large files between two microservices in Mesos?

I have a mesos cluster and I need to run two types of microservices, one is producing very large files (might be more than 2GB for file) the other one is analyzing those files. The analyzing microservice is taking more time than the producer service.
After the analysis service is done - the file can be deleted.
I thought of two options:
NFS - producer service creates all files on NFS and the analysis service is taking it directly from the shared folder. (I'm concerned that this approach will consume all internal bandwidth in my cluster)
Local Disk (my preferred) - in this case I need to somehow enforce the analysis micoroservice to run on the same Mesos slave as the producer service that created this specific file. (I'm not sure this approach is possible)
What would be best practice in this case?
I guess this can be implemented in different ways, depending on your requirements:
If you want to be able to handle a host (agent) failure, I think there is no other way than using a shared filesystem such as NFS. Otherwise, if you use Marathon to schedule your Microservices, the task will be restarted on another agent (where the data isn't locally available). Also, you would then need to make sure that the same mount points are available on each agent, and use these as host volumes in your containers. Unfortunately, the POD feature availability for co-locating tasks starts to be available in Mesos 1.1.0 and Marathon 1.4 (not yet finally released), as a side note...
If you don't care about host (agent) failures, then you possible could co-locate the two Microservices on the same agent if you use hostname constraints in Marathon, and mount the host volumes which then can be shared across the services. I guess you'd need some orchestration to only start the analysis service only after the producing service has finished.

AWS - Load Balanced Instances & Cron Jobs

I have a Laravel application where the Application servers are behind a Load Balancer. On these Application servers, I have cron jobs running, some of which should only be run once (or run on one instance).
I did some research and found that people seem to favor a lock-system, where you keep all the cron jobs active on each application box, and when one goes to process a job, you create some sort of lock so the others know not to process the same job.
I was wondering if anyone had more details on this procedure in regards to AWS, or if there's a better solution for this problem?
You can build distributed locking mechanisms on AWS using DynamoDB with strongly consistent reads. You can also do something similar using Redis (ElastiCache).
Alternatively, you could use Lambda scheduled events to send a request to your load balancer on a cron schedule. Since only one back-end server would receive the request that server could execute the cron job.
These solutions tend to break when your autoscaling group experiences a scale-in event and the server processing the task gets deleted. I prefer to have a small server, like a t2.nano, that isn't part of the cluster and schedule cron jobs on that.
Check out this package for Laravel implementation of the lock system (DB implementation):
https://packagist.org/packages/jdavidbakr/multi-server-event
Also, this pull request solves this problem using the lock system (cache implementation):
https://github.com/laravel/framework/pull/10965
If you need to run stuff only once globally (so not once on every server) and 'lock' the thing that needs to be run, I highly recommend using AWS SQS because it offers exactly that: run a cron to fetch a ticket. If you get one, parse it. Otherwise, do nothing. So all crons are active on all machines, but tickets are 'in flight' when some machine requests a ticket and that specific ticket cannot be requested by another machine.

Resources