I am working on an Elasticsearch indexing task.
Currently we are indexing hundreds of thousands of documents to ES cluster (many ES instances) on daily basis. We simply read data from DB and various of data sources, collate and compile them and directly index them to ES using python elasticsearch-dsl lib.
Currently;
Now we want use Logstash between the application server and ES however we don't want to change our current codebase. Thus we could easily switch between ES to Logstash and keep business logic in our application.
My question is how can I use logstash as a transparent middleware. I simply want logstash to forward all rest messages as they are to the ES cluster.
Want to;
Related
I've K8S cluster up and running. There is Elastic search and Kibana deployed on the K8S cluster.
I need to populate ES with almost 25 t0 50GB of random data to Elastic search for testing. Any easy way to achieve this. I'm a newbie to ES and K8S. Any inputs or pointers will be of great help.
You can use Logstash for ingesting data to the elasticsearch. Logstash supports various input plugins from elasticsearch log4j to S3. You can try ingesting data from any one of the sources that logstash supports as input plugin.
[https://www.elastic.co/guide/en/logstash/current/input-plugins.html][1]
I am trying out the ELK to visualise my log file. I have tried different setups:
Logstash file input plugin https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html
Logstash Beats input plugin https://www.elastic.co/guide/en/logstash/current/plugins-inputs-beats.html with Filebeat Logstash output https://www.elastic.co/guide/en/beats/filebeat/current/logstash-output.html
Filebeat Elasticsearch output https://www.elastic.co/guide/en/beats/filebeat/current/elasticsearch-output.html
Can someone list out their differences and when to use which setup? If it is not for here, please point me to the right place like Super User or DevOp or Server Fault.
1) To use logstash file input you need a logstash instance running on the machine from where you want to collect the logs, if the logs are on the same machine that you are already running logstash this is not a problem, but if the logs are on remote machines, a logstash instance is not always recommended because it needs more resources than filebeat.
2 and 3) For collecting logs on remote machines filebeat is recommended since it needs less resources than a logstash instance, you would use the logstash output if you want to parse your logs, add or remove fields or make some enrichment on your data, if you don't need to do anything like that you can use the elasticsearch output and send the data directly to elasticsearch.
This is the main difference, if your logs are on the same machine that you are running logstash, you can use the file input, if you need to collect logs from remote machines, you can use filebeat and send it to logstash if you want to make transformations on your data, or send directly to elasticsearch if you don't need to make transformations on your data.
Another advantage of using filebeat, even on the logstash machine, is that if your logstash instance is down, you won't lose any logs, filebeat will resend the events, using the file input you can lose events in some cases.
An additional point for large scale application is that if you have a lot of Beat (FileBeat, HeartBeat, MetricBeat...) instances, you would not want them altogether open connection and sending data directly to Elasticsearch instance at the same time.
Having too many concurrent indexing connections may result in a high bulk queue, bad responsiveness and timeouts. And for that reason in most cases, the common setup is to have Logstash placed between Beat instances and Elasticsearch to control the indexing.
And for larger scale system, the common setup is having a buffering message queue (Apache Kafka, Rabbit MQ or Redis) between Beats and Logstash for resilency to avoid congestion on Logstash during event spikes.
Figures are captured from Logz.io. They also have a good
article on this topic.
Not really familiar with (2).
But,
Logstash(1) is usually a good choice to take a content play around with it using input/output filters, match it to your analyzers, then send it to Elasticsearch.
Ex.
You point the Logstash to your MySql which takes a row modify the data (maybe do some math on it, then Concat some and cut out some words then send it to ElasticSearch as processed data).
As for Logbeat(2), it's a perfect choice to pick up an already processed data and pass it to elasticsearch.
Logstash (as the name clearly states) is mostly good for log files and stuff like that. usually you can do tiny changes to those.
Ex. I have some log files in my servers (incl errors, syslogs, process logs..)
Logstash listens to those files, automatically picks up new lines added to it and sends those to Elasticsearch.
Then you can filter some things in elasticsearch and find what's important to you.
p.s: logstash has a really good way of load balancing too many data to ES.
You can now use filebeat to send logs to elasticsearch directly or logstash (without a logstash agent, but still need a logstash server of course).
Main advantage is that logstash will allow you to custom parse each line of the logs...whereas filebeat alone will simply send the log and there is not much separation of fields.
Elasticsearch will still index and store the data.
Hi as a newbie to elastic I have a doubt on why we need fileBeat to ship logs to ElasticSearch(ES) or Logstatsh.
As far as I knew we can directly read logs from files and send to logstash and from there to ES. If the former is allowed why we need FileBeat to be a intermediary layer between logs and logstash.
What i knew : xyzlogfile--->logstash-file--->ES--->kibana
Why do we need FileBeat between : xyzlogfile--->fileBeat--->logstash-file--->ES--->kibana
I assume you are talking about File Input Plugin vs Filebeat.
Some points to note:
Logstash is much heavier in terms of memory and CPU usage than Filebeat. It requires a JVM which might be fine if you deploy java software but for many projects a JVM is an unnecessary overhead. Filebeat is just a light native executable.
You might not need Logstash at all
If your logs are JSON
If you don't need any parsing and you are ok with timestamps generated by Filebeat ([EDIT 2021-01-01] Filebeat has various processors, it can even do arbitrary script execution, pure go implementation of ECMASCRIPT 5.1, https://www.elastic.co/guide/en/beats/filebeat/current/processor-script.html)
If you have simple regex parsing (e.g. grok filter) you can just use Ingest Nodes (https://www.elastic.co/guide/en/elasticsearch/reference/5.0/ingest.html)
For more complex parsing/event cloning/grouping Logstash will probably be needed. Just writing a ruby filter for example is super easy and you can prototype fast. For optimizing super high production loads you might need to write a custom filter plugin, or perhaps you can try writing your own custom Processor to be used with Ingest Nodes (but I haven't tried that yet, I can tell you that writing a custom Logstash filter is pretty straightforward)
All the above points are related to ingesting file contents, but Logstash has many input/output plugins that you might need and are only available with Logstash
If all your files are located on the same node as the logstash process, than using the File Input Plugin could be an option ("xyzlogfile--->logstash-file--->ES--->kibana").
However for most deployments you want to collect data from many nodes with different roles and software stacks deployed on them. You do not want to deploy a Logstash instance on all those nodes, so "xyzlogfile--->fileBeat--->logstash-beats--->ES--->kibana" should be used (or another option is "xyzlogfile--->fileBeat--->ES--->kibana" with Ingest Node).
Based On Mastering Elastic Stack by Packt:
Beats are data shippers shipping data from a variety of inputs such as files, data streams, or logs whereas Logstash is a data parser. Though Logstash can ship data, it's not its primary usage.
Logstash consumes a lot of memory and requires a higher amount of resources , whereas Beats requires fewer resources and consumes low memory.
I have a couchbase cluster setup as the primary source for data. From this a subset of data is synced to a elasticsearch cluster via the Couchbase Transport Plugin for ElasticSearch(https://github.com/couchbaselabs/elasticsearch-transport-couchbase) which sets up an XDCR stream from couchbase to elasticsearch.
Due to some issues with the elasticsearch cluster all data needs to be synced again from couchbase to elasticsearch. I have tried recreating XDCR but that does not seem to help as it only copies a very small subset of documents. Is there a way by which this can be achieved?
Additional details
Couchbase version: 3.1.0
Number of couchbase documents: 50K+
Documents synced to elasticsearch: around 700 (expected 20K+)
If a document in couchbase is modified it is successfully synced to elasticsearch
The issue you're experiencing is likely in one of the following: XDCR, the Couchbase Transport Plugin for Elasticsearch, or Elasticsearch itself.
Start by checking for XDCR errors. You can find your XDCR logs using these instructions. Be aware that the Transport Plugin uses XDCR v1 and almost everything else in Couchbase uses v2.
Consult the advice in troubleshooting the Couchbase Transport Plugin for Elasticsearch. Instructions should work for you even though they are from the 4.0 docs.
Pay attention to how your documents are being mapped to Elasticsearch. You mention that you're expecting only a subset of documents to be synced to Elasticsearch, so it's possible that you have lost a setting or misconfigured something. You can enable logging and observe a small set of test data. At TRACE level, you should be able to see each document that is inspected.
If all of that fails, make sure the basics are working by indexing the beer sample dataset, following the directions in the Couchbase docs. ES is probably not the issue, but test with a fresh ES instance will rule out problems on that side.
Would it be safe to share an elasticsearch cluster (or single-node elasticsearch cluster) between Logstash or graylog2 and my own application? what configuration changes/additions should be made for accomodating that? what kind of name-spacing would the application require for storing its own data in separation from graylog/Logstash?
I'd rather avoid maintaining separate clusters, especially on dev boxes but also in general - if the architecture allows.
It is technically possible but not recommended. You will experience load on the logging cluster that you want to decouple from the other applications using ES.
Graylog2 supports defining an index prefix for having multiple setups running in one ES cluster.
We have both(Kibana and Graylog) running with shared Elasticsearch. It's just that the indexing pattern is something different we have to add Circuit breakers in Elasticsearch so that Kibana search query for logs would not expand beyond a certain size.