Elasticsearch + Logstash - Indexing data from different sources when receiving an event - elasticsearch

Good day,
I have a gotten into a bit of a headache when working on indexing some data in Elasticsearch and have some questions about a good approach.
As of now, an event is received on a Kafka topic with just a part of the data that should be stored in the document. The rest of the data needs to be collected after the event is received and is available from different APIs. To reduce the amount of work, it seems that Logstash could be a good approach.
Is there a way to configure Logstash to initiate data collection from different APIs and DBs when an event is received, and then assemble the document with the combined date, or am I stuck with writing time consuming custom logic for the indexing? I have searched around a bit, but couldn't find any good answer on the problem.

What you need in logstash is to lookup/enrich you message with data from external api's, right?
You could use logstash's http_filter plugin

Related

Running awk in logstash

I do not have the ability to do much but receive unstructured syslogs from Kafka which have been produced with logstash.
When I attach logstash as a consumer, these syslogs are all over the place and contain half a dozen patterns or more which very wildly. This is something more fitting to be run somehow streamed with an awk filter since the programmatic approach to passing incoming messages is actually quite sttisghtforward with such a tool.
Does anyone have any input on how one could attach a consumer to a Kafka topic and procure incoming logs and ship these logs in am intelligent way towards an elasticsearch clister?
Try to use grok expressions in your LOGSTASH config to parse the logs https://logz.io/blog/logstash-grok/ . This should allow you to filter, transform or drop data.
Or use something like CRIBL in between KAFKA and ELASTIC https://docs.cribl.io/stream/about/
Note on the CRIBL page how under sources KAFKA is one of the supported sources and ELASTIC is one of the supported destinations. This should allow to transform your data before ingesting it into ELASTIC.

Deisgn: Generic Elastic Search Index storing all Kafka Topics messages

Hi I am new to Elastic stack. This is basically a design based question. We have lot of Kafka Topics (>500) and each of them store json as data exchange format. Now we are planning to build a Kafka Consumer and dump all the records/jsons into a Single Index. We have some requirements but to begin with the most important one being, able to search through all the relevant jsons based on few important field values. For example if I have multiple jsons having field correlation id with a value XYZ, then if I enter XYZ then it should be able to search through all the topics.
Also as an additional question, since we are using Kibana do we have some inbuilt visualization for this search thing so that we dont need to build our own UI? This is simply for management searching specific values and need not be very fancy UI.
What should be the best thing to do, is having a single index the best design? What all things we need to consider? I read about the standard Analyzer and am wondering if that is enough for our purpose.
Assumption- All Kafka topics will store jsons and each json can be of different formats. Some might have lots of nesting, some might have nested objects. Some might be simple.

How to approach metrics and alerting of produced and consumed messages in grafana

I have problem with creating metrics and later trigger alerts base on that metric. I have two datasources, both are elasticsearch. One contains documents (logs from service) saying that message was produced to kafka, second contain documents (also logs from service) saying that message was consumed. What I want to achieve is to trigger alert if ratio of produced to consumed messages drop below 1.
Unfortunately it is impossible to use prometheus, for two reasons:
1) counter resets each time service is restarted.
2) second service doesn't have (and wont't have in reasonable time) prometheus integration.
Question is how to approach metrics and alerting based on that data sources? Is it possible? Maybe there is other way to achieve my goal?
The question is somewhat generic (meaning no mapping or code, in general, is provided), so I'll provide an approach.
You can use a watcher upon an aggregation that you will create.
It's relatively straightforward to create a percentage of consume/produce, and based upon that percentage you can trigger an alert via the watcher.
Take a look at this tutorial (official elasticsearch channel) on how to do this. Moreover, check the tutorials for your specific version of elasticsearch. From 5.x to 7.x setting alerts has been significantly improved (this means that for 7.x you might be able to do this via the UI of kibana, but for 5.x you'll probably need to add the alert via indexing json in the appropriate indices .watcher)
I haven't used grafana, but I believe the same approach can be applied. You'll need an aggregation as mentioned before and then you add the alert https://grafana.com/docs/grafana/latest/alerting/rules/

Re-processing data for Elasticsearch with a new pipeline

I have an ELK-stack server that is being used to analyse Apache web log data. We're loading ALL of the logs, going back several years. The purpose is to look at some application-specific trends over this time period.
The data-processing pipeline is still being tweaked, as this is the first time anyone has looked in detail into this data and some people are still trying to decide how they want the data to be processed.
Some changes were suggested and while they're easy enough to do in the logstash pipeline for new, incoming data, I'm not sure how to apply these changes to the data that's already in elastic. It took several days to load the current data set, and quite a bit more data has been added so re-processing everything through logstash, with the modified pipeline will probably take several days longer.
What's the best way to apply these changes to data that has already been ingested into elastic? In the early stages of testing this set-up, I would just remove the index and rebuild from scratch, but that was done with very limited data sets and with the amount of data in use here, I'm not sure that's feasible. Is there a better way?
Setup an ingest pipeline and use reindex API to move data from current index to new index (with the pipeline configured for destination index)
Ingest Node

Is it possible to know when some data is available for being searched in Elasticsearch?

I'm implementing a software in which data is sent to some web server, stored in an Elasticsearch and then queried right away. I know that Elasticsearch is a NoSQL following BASE (Basically Available, soft State, eventual consistency) principles which means there's no guarantee when your data will be available for searching.
That's why when I query for the data just being added to Elasticsearch, I have to wait for some time before it is found. Right now all I can do is to implement a polling mechanism to detect when data is completely applied. It is worth mentioning that if I'm using _id to retrieve a document, it is found right away. But if I'm searching for it using some type of Elasticsearch query (like term or query_string), it will take a while before the document is found.
So my question is: Is there a cheaper way to detect when data is completely indexed in Elasticsearch?
This part is done by the Refresh API, this API does not provide a way to know when the indexed data is available. But the folks of elastic are working in a hack to let the request wait for a refresh.
I think should be better if you take a look here: https://www.elastic.co/blog/refreshing_news
This post have a good overview of the issues and the stuffs that they are working to improve.
Hope it help :D

Resources