Elasticsearch data comparison - bash

I have two different Elasticsearch clusters,
One cluster is Elastcisearch 6.x with the data, Second new Elasticsearch cluster 7.7.1 with pre-created indexes.
I reindexed data from Elastcisearch 6.x to Elastcisearch 7.7.1
Is there any way to get the doc from source and compare it with the target doc, in order to check that data is there and it is not affected somehow.

When you perform a reindex the data will be indexed based on destination index mapping, so if your mapping is same you should get the same result in search, the _source value will be unique on both indices but it doesn't mean your search result will be the same. If you really want to be sure everything is OK you should check the inverted index generated by both indices and compare them for fulltext search, this data can be really big and there is not an easy way to retrieve it, you can check this for getting term-document matrix .

Related

ElasticSearch - search by property from concrete index while going through multiple indexes

We're using ElasticSearch and we have two different indexes with different data. Recently, we wanted to make a query that needs data from both indexes. ES allows to search through multiple indexes: /index1,index2/_search. The problem is that both indexes have properties with the same name and there could be collisions because ES doesn't know on which index to search.
How can we tell ES to look up a property from concrete index?
For example: index1.myProperty and index2.otherProperty

ElasticSearch as primary DB for document library

My task is a full-text search system for a really large amount of documents. Now I have documents as RTF file and their metadata, so all this will be indexed in elastic search. These documents are unchangeable (they can be only deleted) and I don't really expect many new documents per day. So is it a good idea to use elastic as primary DB in this case?
Maybe I'll store the RTF file separately, but I really don't see the point of storing all this data somewhere else.
This question was solved here. So it's a good case for elasticsearch as the primary DB
Elastic is more known as distributed full text search engine , not as database...
If you preserve the document _source it can be used as database since almost any time you decide to apply document changes or mapping changes you need to re-index the documents in the index(known as table in relation world) , there is no possibility to update parts of the elastic lucene inverse index , you need to re-index the whole document ...
Elastic index survival mechanism is one of the best , meaning that if you loose node the index lost replicas are automatically replicated to some of the other available nodes so you dont need to do any manual operations ...
If you do regular backups and having no requirement the data to be 24/7 available it is completely acceptable to hold the data and full text index in elasticsearch as like in database ...
But if you need highly available combination I would recommend keeping the documents in mongoDB (known as best for distributed document store) for example and use elasticsearch only in its original purpose as full text search engine ...

How to upsert documents in an existing elasticsearch index?

I have an elasticsearch index which has multiple documents, now I want to update the index with some new documents which might also contain duplicates of the existing documents. What's the best way to do this? I'm using elasticsearch py for all CRUD operations
Every update in elasticsearch deletes the old document and create a new document as the smallest unit of document collection is called segments in elastic-search which are immutable, hence when you index a new document or update any exiting documents, it gets into the new segments which are merged into bigger segments during the merge process.
Now even if you have duplicate data but with the same id, it will replace the existing document, and its fine and more performant than first fetching the document and than comparing both the documents to see if they are duplicate and than discard the update/upsert request from an application, rather than just index whatever if coming and ES will again insert the duplicate docs.

Can we migrate non stored Index data in SOLR to Elastic search?

We are currently using SOLR for full-text search. Now we are planning to move from SOLR to ElasticSearch. When we were in this process i have read somewhere that there are some plugins available which will migrate data from SOLR-ElasticSearch. But it won't be able to migrate those records which are not stored in SOLR. So is there a plugin available which will migrate non-stored index data from SOLR to elastic search if so please let me know.
Currently am using SOLR-to-ES plugin, but it won't migrate the non-stored index data.
Thanks
If the field is not stored, then you don't have the original value. If you have it indexed, what's is in there is the value after it has gone through the analysis chain, and so is probably different than the original one (has no stopwords, is probably lowercased, maybe stemmed...stuff like that).
There are a couple of possibilities that might allow you to have the original content when not stored:
indexed field: if it has been analyzed with just the keyword tokenizer: then the indexed value is the original value.
field has docValues=true then the original value is also stored. This feature was introduced later, so your index might not be using it.
The issue is, the common plugings might not take advantage of those cases where stored=true is not totally necessary. You need to check them.

Elasticsearch > Is it possible to build indices on base of FIELDS

In the context of ELK (Elasticsearch, Logstash, Kibana), I learnt that Logstash has FILTER to make use of grok to divide log messages into different fields. According to my understanding, it only helps to make the unstructured log data into more structured data. But I do no have any idea about how Elasticsearch can make use of the fields (done by grok) to improve the querying performance? Is it possible to build indices on base of the fields like in traditional relational database?
From Elasticsearch: The Definitive Guide
Inverted index
Relational databases add an index, such as a B-tree index, to specific columns in
order to improve the speed of data retrieval. Elasticsearch and Lucene use a
structure called an inverted index for exactly the same purpose.
By default, every field in a document is indexed (has an inverted
index) and thus is searchable. A field without an inverted index is
not searchable. We discuss inverted indexes in more detail in Inverted Index.
So you not need to do anything special. Elasticsearch already indexes all the fields by default.

Resources