OpenSearch Indexes and Data streams - elasticsearch

I have recently started to use OpenSearch and having a few newbie questions.
What is the difference between Index, Index Pattern and Index template? (Some examples would be really helpful to visualize and differentiate these terminologies).
I have seen some indexes with data streams and some without data streams. What exactly are data streams and why some indexes have them and the others do not.
Tried reading a few docs, watching a few youTube videos. But it's getting a little confusing as I do not have much hands on experience with OpenSearch.

(1)
An index is a collection of JSON documents that you want to make searchable. To maximise your ability to search and analyse documents, you can define how documents and their fields are stored and indexed (i.e., mappings and settings).
An index template is a way to initialize with predefined mappings and settings new indices that match a given name pattern - e.g., any new index with a name starting with "java-" (docs).
An index pattern is a concept associated with Dashboards, the OpenSearch UI. It provides Dashboards with a way to identify which indices you want to analyse, based on their name (again, usually based on prefixes).
(2)
Data streams are managed indices highly optimised for time-series and append-only data, typically, observability data. Under the hood, they work like any other index, but OpenSearch simplifies some management operations (e.g., rollovers) and stores in a more efficient way the continuous stream of data that characterises this scenario.
In general, if you have a continuous stream of data that is append-only and has a timestamp attached (e.g., logs, metrics, traces, ...), then data streams are advertised as the most efficient way to model this data in OpenSearch.

Related

Filter result in memory to search in elasticsearch from multiple indexes

I have 2 indexes and they both have one common field (basically relationship).
Now as elastic search is not giving filters from multiple indexes, should we store them in memory in variable and filter them in node.js (which basically means that my application itself is working as a database server now).
We previously were using MongoDB which is also a NoSQL DB but we were able to manage it through aggregate queries but seems the elastic search is not providing that.
So even if we use both databases combined, we have to store results of them somewhere to further filter data from them as we are giving users advanced search functionality where they are able to filter data from multiple collections.
So should we store results in memory to filter data further? We are currently giving advanced search in 100 million records to customers but that was not having the advanced text search that elastic search provides, now we are planning to provide elastic search text search to customers.
What do you suggest should we use the approach here to make MongoDB and elastic search together? We are using node.js to serve data.
Or which option to choose from
Denormalizing: Flatten your data
Application-side joins: Run multiple queries on normalized data
Nested objects: Store arrays of objects
Parent-child relationships: Store multiple documents through joins
https://blog.mimacom.com/parent-child-elasticsearch/
https://spoon-elastic.com/all-elastic-search-post/simple-elastic-usage/denormalize-index-elasticsearch/
Storing things client side in memory is not the solution.
First of all the simplest way to solve this problem is to simply make one combined index. Its very trivial to do this. Just insert all the documents from index 2 into index 1. Prefix all fields coming from index-2 by some prefix like "idx2". That way you won't overwrite any similar fields. You can use an ingestion pipeline to do this, or just do it client side. You only will ever do this once.
After that you can perform aggregations on the single index, since you have all the data in one-index.
If you are using somehting other than ES as your primary data-store you need to reconfigure the indexing operation to redirect everything that was earlier going into index-2 to go into index-1 as well(with the prefixed terms).
100 million records is trivial for something like ELasticsearch. Doing anykind of "joins" client side is NOT RECOMMENDED, as this will obviate the entire value of using ES.
If you need any further help on executing this, feel free to contact me. I have 11 years exp in ES. And I have seen people struggle with "joins" for 99% of the time. :)
The first thing to do when coming from MySQL/PostGres or even Mongodb is to restructure the indices to suit the needs of data-querying. Never try to work with multiple indices, ES is not built for that.
HTH.

ElasticSearch as primary DB for document library

My task is a full-text search system for a really large amount of documents. Now I have documents as RTF file and their metadata, so all this will be indexed in elastic search. These documents are unchangeable (they can be only deleted) and I don't really expect many new documents per day. So is it a good idea to use elastic as primary DB in this case?
Maybe I'll store the RTF file separately, but I really don't see the point of storing all this data somewhere else.
This question was solved here. So it's a good case for elasticsearch as the primary DB
Elastic is more known as distributed full text search engine , not as database...
If you preserve the document _source it can be used as database since almost any time you decide to apply document changes or mapping changes you need to re-index the documents in the index(known as table in relation world) , there is no possibility to update parts of the elastic lucene inverse index , you need to re-index the whole document ...
Elastic index survival mechanism is one of the best , meaning that if you loose node the index lost replicas are automatically replicated to some of the other available nodes so you dont need to do any manual operations ...
If you do regular backups and having no requirement the data to be 24/7 available it is completely acceptable to hold the data and full text index in elasticsearch as like in database ...
But if you need highly available combination I would recommend keeping the documents in mongoDB (known as best for distributed document store) for example and use elasticsearch only in its original purpose as full text search engine ...

Is there any performance benefit to creating an index mapping for Elasticsearch

I was wondering for people who have used Elasticsearch at scale if there is a performance benefit while searching if I create an index mapping and then put documents in it compared to not creating a mapping and just directly putting documents in
It is usually preferable to create the explicit mapping for an index, where possible.
For a search case, this is crucial in order to index data with the analysis chains needed to service the search strategy.
For a log use case, it may not be possible to know what the explicit mapping should be for log records that will be ingested, as there may be dynamic fields in the data that is not known ahead of time. Dynamic templates can help here, as can adopting a unified logging structure like Elastic Common Schema (ECS), either converting data to ECS format whilst logging, or converting whilst ingesting into Elasticsearch with ingest pipelines
Yes it is always better to use explicit mapping before putting the documents rather than depending on the dynamic mapping. If at all you are dependent on the dynamic mapping you may not be able to visualize on few data types like text. And also when you maintain mapping your index will always have the same kind of data. Please refer to this blog:
[https://qbox.io/blog/maximize-guide-elasticsearch-indexing-performance-part-1/][1]

Deisgn: Generic Elastic Search Index storing all Kafka Topics messages

Hi I am new to Elastic stack. This is basically a design based question. We have lot of Kafka Topics (>500) and each of them store json as data exchange format. Now we are planning to build a Kafka Consumer and dump all the records/jsons into a Single Index. We have some requirements but to begin with the most important one being, able to search through all the relevant jsons based on few important field values. For example if I have multiple jsons having field correlation id with a value XYZ, then if I enter XYZ then it should be able to search through all the topics.
Also as an additional question, since we are using Kibana do we have some inbuilt visualization for this search thing so that we dont need to build our own UI? This is simply for management searching specific values and need not be very fancy UI.
What should be the best thing to do, is having a single index the best design? What all things we need to consider? I read about the standard Analyzer and am wondering if that is enough for our purpose.
Assumption- All Kafka topics will store jsons and each json can be of different formats. Some might have lots of nesting, some might have nested objects. Some might be simple.

How does ElasticSearch handle an index with 230m entries?

I was looking through elasticsearch and was noticing that you can create an index and bulk add items. I currently have a series of flat files with 220 million entries. I am working on Logstash to parse and add them to ElasticSearch, but I feel that it existing under 1 index would be rough to query. The row data is nothing more than 1-3 properties at most.
How does Elasticsearch function in this case? In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
I have been walking through the documentation, and it is explaining what to do, but not necessarily all the time explaining why it does what it does.
In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
That is exactly what you need to do. Typically it's an iterative process:
start by putting a subset of the data in. You can also put in all the data, if time and cost permit.
put some search load on it that is as close as possible to production conditions, e.g. by turning on whatever search integration you're planning to use. If you're planning to only issue queries manually, now's the time to try them and gauge their speed and the relevance of the results.
see if the queries are particularly slow and if their results are relevant enough. You change the index mappings or queries you're using to achieve faster results, and indeed add more nodes to your cluster.
Since you mention Logstash, there are a few things that may help further:
check out Filebeat for indexing the data on an ongoing basis. You may not need to do the work of reading the files and bulk indexing yourself.
if it's log or log-like data and you're mostly interested in more recent results, it could be a lot faster to split up the data by date & time (e.g. index-2019-08-11, index-2019-08-12, index-2019-08-13). See the Index Lifecycle Management feature for automating this.
try using the Keyword field type where appropriate in your mappings. It stops analysis on the field, preventing you from doing full-text searches inside the field and only allowing exact string matches. Useful for fields like a "tags" field or a "status" field with something like ["draft", "review", "published"] values.
Good luck!

Resources