Opentelemetry what is the difference between collector and agent? - open-telemetry

I am trying to understand if there is any significant difference between the two.
While looking at the example, I have noticed that it uses exactly the same binary and args (https://github.com/open-telemetry/opentelemetry-collector/blob/main/examples/demo/docker-compose.yaml). The only difference is the config files which have some difference in exporters/recivers.
So the difference only is what endpoint is used to collect/send traces?

No, although the binary is same there is a difference in terms of deployment. The agent is collector instance running on the same host as application that emits the telemetry data. Agent then forwards this data to a Gateway (One or more instances of collectors which receive data from multiple agents). And then data is send to configured backends (Jaeger, Zipkin, Private vendors etc...)

Related

Scaling a microservice with frontend and backend instances

I am developing a series of microservices using Spring Boot and plan to deploy them on Kubernetes.
Some of the microservices are composed of an API which writes messages to a kafka queue and a listener which listens to the queue and performs the relevant actions (e.g. write to DB etc, construct messsages for onward processing).
These services work fine locally but I am planning to run multiple instances of the microservice on Kubernetes. I'm thinking of the following options:
Run multiple instances as is (i.e. each microservice serves as an API and a listener).
Introduce a FRONTEND, BACKEND environment variable. If the FRONTEND variable is true, do not configure the listener process. If the BACKEND variable is true, configure the listener process.
This way I can start scale how may frontend / backend services I need and also have the benefit of shutting down the backend services without losing requests.
Any pointers, best practice or any other options would be much appreciated.
You can do as you describe, with environment variables, or you may also be interested in building your app with different profiles/bean configuration and make two different images.
In both cases, you should use two different Kubernetes Deployments so you can scale and configure them independently.
You may also be interested in a Leader Election pattern where you want only one active replica if it only make sense if one single replica processes the events from a queue. This can also be solved by only using a single replica depending on your availability requirements.

Why use Beats if i can post directly to Elasticsearch?

Recently i have been reading into Elastic stack and finding out about this thing called Beats, which basically used for lightweight shippers.
So the question is, if my service can directly hit to Elasticsearch, do i actually need beats for it? Since from what i have known it's just kinda a proxy (?)
Hopefully my question is clear enough
Not sure which beat you are specifically referring but let's take an example of Filebeat.
Suppose application logs need to be indexed into Elasticsearch. Options
Post the logs directly to Elasticsearch
Save the logs to a file, then use Filebeat to index logs
Publish logs to a AMQP service like RabbitMQ or Kafka, then use Logstash input plugins to read from RabbitMQ or Kafka and index into Elasticsearch
Option 2 Benefits
Filebeat ensures that each log message got delivered at-least-once. Filebeat is able to achieve this behavior because it stores the delivery state of each event in the registry file. In situations where the defined output is blocked and has not confirmed all events, Filebeat will keep trying to send events until the output acknowledges that it has received the events.
Before shipping data to Elasticsearh, we can do some additional processing or filtering. We want to drop some logs based on some text in the log message or add additional field (eg: Add Application Name to all logs, so that we can index multiple application logs into single index, then on consumption side we can filter the logs based on application name.)
Essentially beats provide the reliable way of indexing data without causing much overhead to the system as beats are lightweight shippers.
Option 3 - This also provides the same benefits as option2. This might be more useful in case if we want to ship the logs directly to an external system instead of storing it in a file in the local system. For any applications deployed in Docker/Kubernetes, where we do not have much access or enough space to store files in the local system.
Beats are good as lightweight agents for collecting streaming data like log files, OS metrics, etc, where you need some sort of agent to collect and send. If you have a service that wants to put things into Elastic, then yes by all means it can just use rest/java etc API directly.
Filebeat offers a way to centralize live logs from Multiple Servers
Let's say you are running multiple instances of an application in different servers and they are writing logs.
You can ship all these logs to a single ElasticSearch index and analyze or visualize them from there.
A single static file doesn't need Filebeat for moving to ElasticSearch.

Use Hazelcast Executor Service to be executed on clients

I all the documentation and all the "Google search results" I saw, the hazelcast executor service can be used to be executed on "Members".
I wonder if it is possible to also have things being executed on hazelcast clients?
The distributed executor service is intended to run processing where the data is hosted, on the servers. This is a similar idea to a stored procedure, run the processing where the data lives, save data transfer.
In general, you can't run a Java Runnable or Callable on the clients as the clients may not be Java.
Also, the clients don't host any data, so they'd have to fetch what data they need from the servers potentially.
If you want something to run on all or some connected clients, you could implement this yourself using the publish/subscribe mechanism. A payload could be sent to an ITopic with the necessary execution parameters, and clients listening can act on the message.
You can also create a Near Cache on client side and use JDK’s ExecutorService that runs in your local jvm app.

Multiple Logstash instances vs Filebeats

I'm trying to establish the best architecture for our elastic stack implementation.
We have two distinct networks (lets call them internal and external) and several web / db / application servers (approx 10) on each of these networks.
I would like to consume IIS logs, our rabbitMQ messages and some other bits and bobs from machines in both networks and send them to a single server on the internal network where my elastic and kibana installation are located.
For the servers on both the internal and external networks I can see two main ways to get the logs sent to elastic.
Setup logstash on each server and send the output to the elastic server on the internal network.
Setup filebeats on each server and send the logs to a single server running logstash (this could be the same box that hosts elastic and kibana)
I'm unsure of the pros and cons of these approaches at the moment. I believe the correct approach is to use Filebeats, but I'm unaware why I wouldn't just put logstash in multiple places as it seems like I would be better distributing the processing of logs.
Then again, perhaps having one logstash with 20-30 inputs isn't a problem?
Interested in any thoughts or guidance in this area.
From what I read in the documentation, Logstash is much more demanding in term of memory than Filebeat, especially if you do some kind of treatment on the logs (like grok parsing). Logstash represent at least a JVM (with JRuby). For filebeat, I assume its footprint is much smaller, since it's optimized for shipping logs (I never used it, so I can't say).
Also it complicates any update you would want to do to the Logstash instances or their configurations.
For a centralized Logstash, the advantage would be that it is easy to change the adress of the Elasticsearch instance, redirect to a cache like redis or add another output. I also found Logstash (in version 2.+) required frequent restart, so that's easier if you only have one instance to deal with.
I have never used Logstash with multiple inputs, so I can't say.
In the job where I was responsible of a log centralisation system, we used beaver (a filebeat equivalent) to ship the logs to a redis server and we had two or three Logstash server sending everything to Elasticsearch. All of the comments above comes from that period.

Vert.x cluster Eventbus cross processes

Does any body have some info, links, pointer on how is cross process Eventbus communication is occurring. Per documentation I am concluding that multiple Vert.x (thus separate JVM processes) could be clustered on and communicate via Eventbus. However, there are little to none documentation on how to achieve it.
Looking into DOCs, I can see that publish/registerHandler methods take address as a String what works within a process, but I can not wrap my head around on how it works cross processes and how to register and publish to address, does it work over HTTP , TCP ? From API perspective do I need to pass port and process signature ?
Cross process communication happens via the EventBus. Multiple vertx instances can be started up and clustered to allow separate instances on the same or other machines to communicate. The low level clustering is handled by Hazelcast.The configuration is handled by the cluster.xml file in the conf folder of your vertx install. You can learn more about the format of the file by looking at the Hazelcast Docs. It is transparent to your handers and works over TCP.
You can test it by running two or more instances on your local machine once they are started with the -cluster flag. Look at the example being run, and the config changes required in How to use eventbus messaging in vertx?

Resources