Kafka vs filebeat for shippong logs to logstash - elasticsearch

I am currently setting up the central logging system (using ELK) which is estimated to get log data from 100 of micro services and could expand more. Requirement is to have minimum latency and highly available solution
Right now I am stuck on how design should look like.
While studying over internet, I got the below approach as widely used for such requirements
Microservice -> filebeat -> kafka -> logstash -> ElasticSearch -> Kibana
However, I am struggling to understand if filebeat is really useful in this case.
What if I directly stream logs to Kafka which then ships it to logstash ? This will help me to overcome the maintenance of log files and also there will be one component less to monitor and maintain.
I see an advantage of using kafka over filebeat that it can act as a buffer in conditions if the data being shipped is very high in volume or when the ES cluster is unreachable. Source : https://www.elastic.co/blog/just-enough-kafka-for-the-elastic-stack-part1
I want to understand if there is any real benefit of having filebeat that I am unable to realise.

Filebeat can be installed on each of your servers or nodes. Filebeat collects and quickly sends logs. It is very fast and lightweight, written in go.
In your case, the advantage is that you don't have to spend time developing the same functionality for collecting and sending logs. You just use and configure the Filebeat for your logging architecture. This is very convenient.
Another description of Filebeat is available at the link.

Related

What is the use of FileBeat while parsing logs in Elasticsearch

I am not understanding the concept why file beat is required while we have a logstash.
With filebeat you are able to collect and forward logfiles from one or many remote servers.
There is also a option to add source specific fields to your log entries.
You have several output options like elasticsearch or logstash for further analysis/filtering/modification.
Just imagine 20 or 200 machines running services like databases, webservers, hosting applications and containers. And now you need to collect all the logs...
only with logstash you'll be pretty limited in this scenario
Beats are light-weight agents used primarily for forwarding events from multiple sources. Beats have a small footprint and use fewer system resources than Logstash.
Logstash has a larger footprint, but provides a broad array of input, filter, and output plugins for collecting, enriching, and transforming data from a variety of sources.
Please note though that filebeat is also capable of parsing for most use cases using Ingest Node as described here.

Proper crawler architecture - how to using ElasticSearch ecosystem?

In a v1.0 of a .Net data crawler, I created a Windows service that would read URLs from a database and based on some logic would select what to crawl on a specified interval.
This was single-threaded and worked well for a low number of endpoints, but scaling is obviously an issue.
I'm trying to find out how to do this using the ElasticSearch (ELK) stack and came across HTTPBeat,
a Beat to poll HTTP endpoints in a regular interval and ship the
result to the configured output channel, e.g. Logstash and
Elasticsearch.
In looking at the documentation, you have to add URLs to the config.yaml file. Not what I'm looking for as the list of URLs could change and we may not want all URLs crawled at the same time.
Then there's RSS for Logstash, which is a command-line tool - again, not what I'm looking for.
Is there a way to make use of the Beats daemon to read from the ElasticSearch database to do work based on database values - crawls, etc?
To take this to the enterprise level, do Beats or any other component of the ElasticSearch ecosystem use message queuing or a spooler (like FileBeats does - is this built into Beats?)?

Centralized ELK vs Centralized EK + Multiple Logstash

We want to set up a common logging interface across all the product teams in our company. We chose ELK for this and i want some advice regarding the set up:
One way is to have centralized ELK set up and all teams can use some sort of log forwarder e.g. FileBeat to send logs to common logstash. The issue with this i feel is : If teams want to use filters on the logs for analyzing log messages, they would need to access the common ELK machine to add filters as Beats doesn't support groking or any other filtering.
Second way is to have different logstash servers per team and all those will point to common Elastic Search server. This way teams are free to modify/add grok filters.
Please enlighten me if i am missing something or may be i am wrong in understanding. Other ideas are welcome.
Have you considered using fluentd instead? Lightweight, similar to filebeat, and allows groking and parsing.
Of course, your other alternative is to use a centralized Logstash instance and have different configuration files for each entity.

Architecture of syncing logs to hadoop

I have a different environments across a few Cloud providers, like windows servers, linux servers in rackspace, aws..etc. And there is a firewall between that and internal network.
I need to build a real time servers environment where all the newly generated IIS logs, apache logs will be sync to an internal big data environment.
I know there are tools like Splunk or Sumologic that might help but we are required to implement this logic in open source technologies. Due to the existence of the firewall, I am assuming I can only pull the logs instead push from the cloud providers.
Can anyone share with me what is the rule of thumb or common architecture for sync up tons of logs in NRT (near real time)? I heard of Apache Flume, Kafka and wondering if those are required or it is just a matter of using something like rsync.
You can use rsync to get the logs but you can't analyze them in the way Spark Streaming or Apache Storm does.
You can go ahead with one of these two options.
Apache Spark Streaming + Kafka
OR
Apache Storm + Kakfa
Have a look at this article about integration approaches of these two options.
Have a look this presentation, which covers in-depth analysis of Spark Streaming and Apache Storm.
Performance is dependent on your use case. Spark Steaming is 40x faster to Storm processing. But if you add "reliability" as key criteria, then data should be moved into HDFS first before processing by Spark Streaming. It will reduce final throughput.
Reliability Limitations: Apache Storm
Exactly once processing requires a durable data source.
At least once processing requires a reliable data source.
An unreliable data source can be wrapped to provide additional guarantees.
With durable and reliable sources, Storm will not drop data.
Common pattern: Back unreliable data sources with Apache Kafka (minor latency hit traded for 100% durability).
Reliability Limitations: Spark Streaming
Fault tolerance and reliability guarantees require HDFS-backed data source.
Moving data to HDFS prior to stream processing introduces additional latency.
Network data sources (Kafka, etc.) are vulnerable to data loss in the event of a worker node failure.

logstash to receive log from android? or is this elasticsearch?

I'm still a bit confused after reading documentation provided by logstash. I'm planning on writing an Android app, and I want to log the activity of the app. Logs will be sent over the network. is logstash not the right solution? because it needs to have an "agent" installed on systems that produces log.
I want a system that can store log from the app activity, but it also needs to be able to export the collected logs into plain text file. I know logstash can output to elasticsearch, but i'm not sure if it can export to plaintext file at the same time. or is this a task that ElasticSearch should do?
thanks a ton for any input you can provide
Logstash forwarder isn't currently available for android/ios unfortunately, nor could I find any existing solution for it from the community. (I asked the same question here but was voted off-topic because it was deemed asking for tool/library suggestions).
Your best bet unfortunately is either to write one yourself (which isn't trivial: you'll need to factor in offline connectivity, batching, scheduling, compressions, file-tracking, and so on), or to use other (usually commercial) logging services such as LogEntries.
By the way, the android/ios clients for LogEntries is open source. I'm not clear on its OSS licensing, but if you're to write an agent for logstash yourself, you could perhaps start by looking at LogEntries' android agent implementation, which already solves all the technical problems mentioned above. https://github.com/logentries/le_android.
And to answer your other question, yes logstash should receive your log (from the mobile-device), usually via lumberjack input (aka logstash forwarder). Logstash can then persist & index these log files to elasticsearch, providing it's configured that way

Resources