How to add only new docs or changed docs on elasticsearch? - elasticsearch

Scenario: Script pulls data from an external API, formats the results as a dictionary/json object, and pushes the data to elasticsearch. The script is scheduled to run periodically.
Conditions: The script should only push the dictionaries for records that do not already exist in elasticsearch. And for records that exist in elasticsearch, update fields if any data has been changed.
My Approach: The records from the API have an ID which I use to check if they exist in elasticsearch by doing a search query. I make a list of IDs that do not exist in elasticsearch and push the corresponding records to elasticsearch.
Issue: For example, if record with {'ID':1, 'Status':'Started'} was pushed to elasticsearch yesterday. Now the data has changed to {'ID':1, 'Status':'Completed'} it will still be ignored because I am checking only the ID.
Solution that I am thinking of: Insert into elasticsearch by comparing all the fields of the json object/dictionary. If everything matches, skip insertion. If any field has different value insert into elasticsearch [Redundancy of having multiple docs for the same record is not an issue. Redundancy of having multiple docs for the same record with all the same values needs to be avoided.]

You can pass the document ID to the index method. This will insert the record if it doesn't exist or it will update any fields that are different. This way you don't need to add custom logic to manage that ID as a regular field.

Related

Fetching Index data from elasticsearch DB using Spring Data ElasticSearch

I have a java code which connects to Elasticsearch DB using Spring-data-elasticsearch and fetches all the index data by connecting to the repository and executing the findAll() method. The data received from ES is being processed by a seperate application. When new data is inserted into elastic search, I have the below queries
1. How can I fetch only the newly inserted data Programatically ?
2. Apart from using the DSL queries, Is there a way to Asyncronously get the new records as and when new data is inserted into elasticsearch DB.
I dont want to execute the findAll() method again. Because it returns the entire data ( including the previously processed records as well) .
Any help on this is much appreciated.
You will need to add a field (I call it createdAt here) to your entities that contains the timestamp when your application inserts into Elasticsearch. One possibility would be to use the auditing support of Spring Data Elasticsearch to have the value set automatically, or you set the value in your application. If the data is inserted by some other application you need to make sure that it contains a timestamp in a format that maps the field type definition of this field in your application.
Then you'd need to define a method in your repository like
SearchHits<T> findByCreatedAtAfter(Timestamp referenceValue);
As for getting a notification in some form when new data is inserted: I'm not aware that Elasticsearch offers something like that. You will probably need to regularly call the method that retrieves the data.

Implement created_on and updated_on logic on client side

From elasticsearch > 2, there is no _timestamp field. we have to explicitly populate time fields like created_on and updated_on
One way i know to populate these fields is check item to be populated is already existing in Database using uid (assume uid generated on client side using some item properties). If item exists in Database, update all fields except created_on. If item does not exist, create entry in database with item and created_on equal to current time.
My questions are:
* Isn't checking every time i create/update redundant ??
* Is there any better way to implement created_on and updated_on logic on client side without redundant (without querying elasticsearch) ??
Using a "middleware" for this is a good way to avoid having this kind of logic in the client, once you change the design, you would need to perform changes on every client implementation, so I think is a good use case for ingesting pipelines and there is an example in the doc.
Accessing Ingest Metadata Fields:
Beyond metadata fields and source fields, ingest also adds ingest metadata to the documents that it processes. These metadata properties are accessible under the _ingest key. Currently ingest adds the ingest timestamp under the _ingest.timestamp key of the ingest metadata. The ingest timestamp is the time when Elasticsearch received the index or bulk request to pre-process the document.
If you need more intelligent middleware, mind the Script Processor which allows inline and stored scripts to be executed within ingest pipelines.

Elastic search for batch update of old documents

In my application, I am using Elasticsearch for indexing and searching of documents.As expected, documents have some fields.
Due to new requirements, users want those documents to have some more new fields. I can add new fields for newly created documents, but I also need to have old documents too to have these fields.
I am thinking of writing a framework which would accept generic criteria to read old documents and update them. By generic criteria, I mean it must be able to accept any user defined condition to read older documents.
I am new to ES,and hence not sure if its feasible.
So I want to know whether it is feasible to write such a framework using Elastic search?
If you provide a custom document id, you can reindex your existing data with the update api (available also in the upsert mode). In this way you can update the documents adding the new fields when you re-import the old data.
It is important to provide a document id, otherwise it is impossible to add fields to the existing documents, since only insert are possible.

Is it possible for an Elasticsearch index to have a primary key comprised of multiple fields?

I have a multi-tenant system, whereby each tenant gets their own Mongo database within a MongoDB deployment.
However for elastic search indexing, this all goes into one elastic instance via Mongoosastic, tagged with a TenantDB to keep data separated when searching.
Currently we have some of the same _id's reused across the multiple databases in test data for various config collections(Different document content, same _id), however this is causing a problem when syncing to elastic as although they're in separate databases when they come into elastic with the same Type and ID one of them gets dropped.
Is it possible to specify both the ID and TenantDB as the primary key?
Solution 1: You can search for multiple index in Elasticsearch. But, If you can not separate your index for database, you can follow like below method. While syncing your data to elasticsearch, use a pattern to create elastic document _id. For example, from mongoDb1 use mdb1_{mongo_id}, from mongoDb2 use mdb2_{mongo_id} , etc. This will be unique your _ids if you have not same id in same mongo database.
Solution 2: Separate your index.

Elasticsearch index with historical versions of documents

I have an Elasticsearch index continuously being updated and I'm creating a second index with the same mappings for doing offline analytics: I need to store changes for certain fields, in order to retrieve the values that were associated in specific time in the past. Therefore, in this second index I store multiple versions of the same document (same id but different _id fields).
My objective is to get ranked results for a given query and reference date. I've tried with aggregations but rather than modifying the hits fields you get a new aggregations one with unordered results.
Is there any way other than removing duplicates at the client side?
This is similar but different to this previous question as the proposed solution of just having a boolean current field allows for removing duplicates when querying the present.

Resources