Why use Beats if i can post directly to Elasticsearch? - elasticsearch

Recently i have been reading into Elastic stack and finding out about this thing called Beats, which basically used for lightweight shippers.
So the question is, if my service can directly hit to Elasticsearch, do i actually need beats for it? Since from what i have known it's just kinda a proxy (?)
Hopefully my question is clear enough

Not sure which beat you are specifically referring but let's take an example of Filebeat.
Suppose application logs need to be indexed into Elasticsearch. Options
Post the logs directly to Elasticsearch
Save the logs to a file, then use Filebeat to index logs
Publish logs to a AMQP service like RabbitMQ or Kafka, then use Logstash input plugins to read from RabbitMQ or Kafka and index into Elasticsearch
Option 2 Benefits
Filebeat ensures that each log message got delivered at-least-once. Filebeat is able to achieve this behavior because it stores the delivery state of each event in the registry file. In situations where the defined output is blocked and has not confirmed all events, Filebeat will keep trying to send events until the output acknowledges that it has received the events.
Before shipping data to Elasticsearh, we can do some additional processing or filtering. We want to drop some logs based on some text in the log message or add additional field (eg: Add Application Name to all logs, so that we can index multiple application logs into single index, then on consumption side we can filter the logs based on application name.)
Essentially beats provide the reliable way of indexing data without causing much overhead to the system as beats are lightweight shippers.
Option 3 - This also provides the same benefits as option2. This might be more useful in case if we want to ship the logs directly to an external system instead of storing it in a file in the local system. For any applications deployed in Docker/Kubernetes, where we do not have much access or enough space to store files in the local system.

Beats are good as lightweight agents for collecting streaming data like log files, OS metrics, etc, where you need some sort of agent to collect and send. If you have a service that wants to put things into Elastic, then yes by all means it can just use rest/java etc API directly.

Filebeat offers a way to centralize live logs from Multiple Servers
Let's say you are running multiple instances of an application in different servers and they are writing logs.
You can ship all these logs to a single ElasticSearch index and analyze or visualize them from there.
A single static file doesn't need Filebeat for moving to ElasticSearch.

Related

How to get errors captured from nifi logs specific to my application when multiple nifi applications are running

We have multiple team nifi applications running in same nifi machine... Is there any way to log the logs specific to my application? Also by default nifi-app.log file is difficult to track the issues and bulletin board shows the error msg for only 5 mins... How to get the errors captured and send an mail alert in Nifi?
Please help me to get through this. Thanks in advance!
There are a couple ways to approach this. One is to route failure relationships from processors to a PutEmail processor which can send an alert on errors. Another is to use a custom reporting task to alert a monitoring service when a certain number of flowfiles are in an error queue.
Finally, we have heard that in multitenant environments, log parsing is difficult. While NiFi aims to reduce or completely eliminate the need to visually inspect logs by providing the data provenance feature, in the event you do need to inspect the logs, we recommend searching the log by processor ID to isolate relevant messages. You can also use NiFi itself to ingest those same logs and perform parsing and filtering activities if desired. Future versions may improve this experience.
By parsing the nifi log, you can separate the logs which is specific to your team applications, by using the processor group id and using Nifi Rest API. Check the below link for the nifi template and python codes to solve this issue:
https://link.medium.com/L6IY1wTimV
You can send all the errors in a Processor Group to the same processor, it could be a regular UpdateAttribute or a custom processor, this processor is going to add the path and all the relevant information and then send this to a general error/logs flow that is going to check the information inside the flowfile regarding to the error, and will make the decision of send or not an email, to whom and this kind of things.
Using this approach, the system keeps simple and inside NiFi, so you don't add more layers of complexity, and you are going to have only one processor to manage the errors per Process Group.
This is the way we are managing errors in my company.

Multiple Logstash instances vs Filebeats

I'm trying to establish the best architecture for our elastic stack implementation.
We have two distinct networks (lets call them internal and external) and several web / db / application servers (approx 10) on each of these networks.
I would like to consume IIS logs, our rabbitMQ messages and some other bits and bobs from machines in both networks and send them to a single server on the internal network where my elastic and kibana installation are located.
For the servers on both the internal and external networks I can see two main ways to get the logs sent to elastic.
Setup logstash on each server and send the output to the elastic server on the internal network.
Setup filebeats on each server and send the logs to a single server running logstash (this could be the same box that hosts elastic and kibana)
I'm unsure of the pros and cons of these approaches at the moment. I believe the correct approach is to use Filebeats, but I'm unaware why I wouldn't just put logstash in multiple places as it seems like I would be better distributing the processing of logs.
Then again, perhaps having one logstash with 20-30 inputs isn't a problem?
Interested in any thoughts or guidance in this area.
From what I read in the documentation, Logstash is much more demanding in term of memory than Filebeat, especially if you do some kind of treatment on the logs (like grok parsing). Logstash represent at least a JVM (with JRuby). For filebeat, I assume its footprint is much smaller, since it's optimized for shipping logs (I never used it, so I can't say).
Also it complicates any update you would want to do to the Logstash instances or their configurations.
For a centralized Logstash, the advantage would be that it is easy to change the adress of the Elasticsearch instance, redirect to a cache like redis or add another output. I also found Logstash (in version 2.+) required frequent restart, so that's easier if you only have one instance to deal with.
I have never used Logstash with multiple inputs, so I can't say.
In the job where I was responsible of a log centralisation system, we used beaver (a filebeat equivalent) to ship the logs to a redis server and we had two or three Logstash server sending everything to Elasticsearch. All of the comments above comes from that period.

Log inactivity monitoring in ELK stack

I am configuring an ELK stack server with filebeat which monitors log files and sends to log stash. Is it possible to configure an alerting mechanism either at filbeat or log stash level such that we get alert in case the logs being monitored are no longer being written into.
Filebeat and Logstash are event oriented so they can't tell you when data is not being shipped since nothing is being triggered. For this you would probably need to purchase the Elastic Watcher alerting mechanism or use services like Logz.io who also offer an alerting mechanism.

Fluentd vs Kafka

The use case is this:
I've several java applications running which all have to interact with different (each one has a specific target) elasticsearch indices. For instance an application A uses the indices A,B,C of ElasticSearch to query and update. Application B uses indices A,C,D(say).
Some common interface is required which can manage all these data streams. Currently I'm evaluating Kafka and fluentd for this purpose.
Can someone explain which will be better suited for this situation. I've looked at features of both Kafka and Fluentd and I don't really understand the difference it would make here.
Thanks a lot.
kafka provides publish/subscribe messaging as a distributed commit log. Usually you install kafka on each host where you need to produce some data to be forwarded somewhere else and all those hosts will together form a cluster. The good thing here is that if for some reason network connectivity becomes unstable or goes down, your application can continue to produce data/logs and they won't be lost. Whereas if your application directly sends logs to some remote centralized logging host, you might lose some logs during the time the network goes down.
fluentd is a centralized log collector which is commonly installed on one host (or more if you need horizontal scaling). It connects to remote data sources, applies filtering and sends unified log data to remote data sinks.
From the fluentd docs, you can see that fluentd can consume data from kafka and produce data towards kafka as well. This alone should hint that fluentd and kafka are on different layers since the former uses the latter.
It would be more logical to compare fluentd and logstash actually. As far as fluentd is concerned, kafka is just another data source and/or data sink, but they are different beasts altogether.
If you want the best of both worlds, use kafka as input/output data pipes from/to your apps and fluentd (or logstash) as your centralized logging system reading from those kafka topics.
If you want to read more on the topic, you can read how fluentd and kafka complement each other very well, read they are not competing against each other.
From: The Life Blood Of Your Data Pipeline
Kafka is primarily related to holding log data rather than moving log
data. Thus, Kafka producers need to write the code to put data in
Kafka, and Kafka consumers need to write the code to pull data out of
Kafka.
Fluentd has both input and output plugins for Kafka so that data
engineers can write less code to get data in and out of Kafka. We have
many users that use Fluentd as a Kafka producer and/or consumer.

Monitoring health of logstash

I am going to be using logstash to send a high amount of events to a broker. I have monitoring of the broker to check the health status, but I can't find much information on how to see if the logstash process is healthy, if there are indicators of a failing process.
I was interested for those who use logstash, what are some ways you monitor it?
You can have a cronjob inject a heartbeat message and route such messages to some kind of monitoring system. If you already use Elasticsearch you could use it for this as well and write a script to ensure that you have reasonably recent heartbeat messages from all hosts that should be sending messages, but I'd prefer using e.g. Nagios or lovebeat-go.
This could be used to monitor the health of a single Logstash instance (i.e. you inject the heartbeat message into the same instance that feeds the monitoring software) but you could just as well use it to check the overall health of the whole pipeline.
Update: This got built into Logstash in 2015. See the announcement of the Logstash heartbeat plugin.
If you're trying to monitor logstash as a shipper, it's easy to write a script that would compare the contents of the .sincedb* file to the actual file on disk to make sure they're in sync.
As an indexer, I'd probably skip ahead and query ElasticSearch for the number of documents being inserted.
#magnus' idea for a latency check is also good. I've used the log's timestamp and compared it to ElasticSearch's timestamp to compute the latency.

Resources