Deisgn: Generic Elastic Search Index storing all Kafka Topics messages

Deisgn: Generic Elastic Search Index storing all Kafka Topics messages - elasticsearch

Hi I am new to Elastic stack. This is basically a design based question. We have lot of Kafka Topics (>500) and each of them store json as data exchange format. Now we are planning to build a Kafka Consumer and dump all the records/jsons into a Single Index. We have some requirements but to begin with the most important one being, able to search through all the relevant jsons based on few important field values. For example if I have multiple jsons having field correlation id with a value XYZ, then if I enter XYZ then it should be able to search through all the topics.
Also as an additional question, since we are using Kibana do we have some inbuilt visualization for this search thing so that we dont need to build our own UI? This is simply for management searching specific values and need not be very fancy UI.
What should be the best thing to do, is having a single index the best design? What all things we need to consider? I read about the standard Analyzer and am wondering if that is enough for our purpose.
Assumption- All Kafka topics will store jsons and each json can be of different formats. Some might have lots of nesting, some might have nested objects. Some might be simple.

Related

OpenSearch Indexes and Data streams

I have recently started to use OpenSearch and having a few newbie questions.
What is the difference between Index, Index Pattern and Index template? (Some examples would be really helpful to visualize and differentiate these terminologies).
I have seen some indexes with data streams and some without data streams. What exactly are data streams and why some indexes have them and the others do not.
Tried reading a few docs, watching a few youTube videos. But it's getting a little confusing as I do not have much hands on experience with OpenSearch.

(1)
An index is a collection of JSON documents that you want to make searchable. To maximise your ability to search and analyse documents, you can define how documents and their fields are stored and indexed (i.e., mappings and settings).
An index template is a way to initialize with predefined mappings and settings new indices that match a given name pattern - e.g., any new index with a name starting with "java-" (docs).
An index pattern is a concept associated with Dashboards, the OpenSearch UI. It provides Dashboards with a way to identify which indices you want to analyse, based on their name (again, usually based on prefixes).
(2)
Data streams are managed indices highly optimised for time-series and append-only data, typically, observability data. Under the hood, they work like any other index, but OpenSearch simplifies some management operations (e.g., rollovers) and stores in a more efficient way the continuous stream of data that characterises this scenario.
In general, if you have a continuous stream of data that is append-only and has a timestamp attached (e.g., logs, metrics, traces, ...), then data streams are advertised as the most efficient way to model this data in OpenSearch.

Elasticsearch + Logstash - Indexing data from different sources when receiving an event

Good day,
I have a gotten into a bit of a headache when working on indexing some data in Elasticsearch and have some questions about a good approach.
As of now, an event is received on a Kafka topic with just a part of the data that should be stored in the document. The rest of the data needs to be collected after the event is received and is available from different APIs. To reduce the amount of work, it seems that Logstash could be a good approach.
Is there a way to configure Logstash to initiate data collection from different APIs and DBs when an event is received, and then assemble the document with the combined date, or am I stuck with writing time consuming custom logic for the indexing? I have searched around a bit, but couldn't find any good answer on the problem.

What you need in logstash is to lookup/enrich you message with data from external api's, right?
You could use logstash's http_filter plugin

Using elastic search for a UI dashboard behind a proxy

I am working on a search dashboard with full text search capabilities, backed by ES. The search would initially be consumed by a UI dashboard. I am planning to have an application web service (WS) api layer between the UI dashboard and ES which will route the business search to ES.
There can be multiple clients to WS going forward, each with its own business use cases, and complex data requirements (basically response fields). There are many entities and huge number of fields across them. Each client would need to specify what fields entities it wants to return with what fields.
To support this dynamically changing requirement, one approach could be to have the WS be a pass through to the ES (with pre validations like access control and post transformations to the response from ES). The WS APIs will look exactly like the ES APIs, the UI should build ES queries through JS client and send it to WS, which after access control will get data from ES.
I am new to ES and skeptic of this approach. Can there be any particular challenges in this approach. One of my colleague has worked on ES before but always with a backend Java client, so he's not too sure.
I looked up a ES Js client and there's an official one here.
Some Context here:
We have around 4 different entities (can increase in future) with both full text and keyword type fields. A typical search could have multiple filters and search terms and would want to specify the result fields. Also, some searches would be across entities and some to individual ones. We are maintaining a separate entity for each entity.

What I understand from your post is, below is what you want to achieve at high level.
There can be multiple clients to WS going forward, each with its own
business use cases, and complex data requirements (basically response
fields)
And as you are not sure, how to do this, you are thinking to build Elasticsearch queries from Javascript in your front-end only. I am not a very big fan of this approach as it exposes, how you are building queries and if some hacker knows crucial like below information, then can bring your entire ES cluster to its knees:
Knows what types of wildcard queries.
Knows index names and ES cluster details(although you may have access control but still you are exposing the crucial info).
How you are building your search queries.
Above are just a few examples and will add more info.
Right approach
As you already have a backend, where you would be checking the access, there only build the Elasticsearch queries and you even have the advantage of your teammates who knows it.
For building complex response field, you can use the source filtering, using which you can specify in your search request, what all fields you want to return in your search result.

Best way to set up ElasticSearch for searching in each customer's data only

We have a SAAS product where companies create accounts and populate their own private data. We are thinking about using ElasticSearch to allow the customer to search all their own data in our system.
As an example we would have a free text search where the user can type anything and the API would return multiple different types of objects. E.g. they type John and the API returns the user object for users matching a first name containing John, or an email containing John. Or it might also return a team object where the team name matches John (e.g. John's Team) etc.
So my questions are:
Is ElasticSearch a sensible choice for what we want to do from a
concept perspective?
If we did use ElasticSearch what would be the
best way to index the data so we can search all data for a
particular customer? Does each customer have its own index?
Are there any hints on how we keep ElasticSearch in sync with the data in the database (DynamoDB)? If we index the data for a customer and then update the data as it changes is it sensible to then also reindex the data on a scheduled basis too?
Thanks!

I will try to provide general answers from my own experience with splitted customer data with elastic search:
If you want to search through a lot of data really fast, ES is always a really good solution for this - it comes with the cost of an secondary data storage that you will have to keep in sync with your database.
You cant have diffrent data types in one index, so the case would be either to create one index per data type and customer (carefull, indices come with an overhead - avoid creating too much with little data in it) - or you create one index per data type and add a property to your data where you then can filter it with e.g. a customer number.
You will have to denormalize your data as much as possible to benefit from elastic search.
As mentioned in 1 you will need to keep both in sync - there are plenty ways too do that. As an example we use a an event driven approach to push critical updates into elasticsearch as soon as possible (carefull: its not SQL - so you will always have some concurrency issues when u need read and write safety). For data that is not highly critical we use jobs that update them regulary. When you index a document with the same id it will get completely updated.
Hope this helps, feel free to asy questions.

Working with NLP tags in Elasticsearch

Working on a large data-oriented search product powered by elasticsearch. We've built a lot of machine learning functionality on top of this app, but currently we're having some difficulty deciding how to integrate fairly standard NLP-based word tags into our ES index.
Currently we have a tagging service that can annotate a word with a respective type (or types, but one may be useful enough for now). This function could be abstracted to: type = getWordType(word) I imagine there must be a way to integrate this tagging service into the analysis chain that is applied at index time, where, maybe, we tell the index what type a particular word belongs to. However, doing this kind of advanced analysis is a bit beyond my elasticsearch capacity. Does anyone have pointers on this kind of advanced analysis in elasticsearch?
Thanks!

you might want to take a look at the ingest node functionality introduced in Elasticsearch 5.0. This allows you to preprocess your documents and add fields into the JSON before the document is being indexed in Elasticsearch.
I wrote an ingest processor that is using OpenNLP to enrich documents. You could take a look at that one and adapt it to your needs (also, pull requests are very welcome).
Check it out at https://github.com/spinscale/elasticsearch-ingest-opennlp

This is achieved in Elasticsearch 6.5 with the type annotated_text: https://www.elastic.co/guide/en/elasticsearch/plugins/6.x/mapper-annotated-text-usage.html
Essentially, kind of like synonyms, the tags (or named entity IDs, etc) can exist at the same position as the word you’re tagging.
Needs a plugin installed, the Mapper Annotated Text Plugin.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio