We have base database where the Candidate records are created/updated/deleted. In elasticsearch we are using es generated ID. However database do not know the ES ID generated for the record. In a batch process we are fetching the 500 records from DB and sending them to ES but we do not know which records needs to insert and which record needs to update. Also we do not know the ES ID of already updated records.
Using CURL request in BULK is there any way we can first check whether the record is present or not using unique email address of each record. If it is present then send Update request or if it is not present then send Insert request.
Is there any way we can write the script inside the BULK call?
Regards,
Jayesh Bhoyar
Related
I have a java code which connects to Elasticsearch DB using Spring-data-elasticsearch and fetches all the index data by connecting to the repository and executing the findAll() method. The data received from ES is being processed by a seperate application. When new data is inserted into elastic search, I have the below queries
1. How can I fetch only the newly inserted data Programatically ?
2. Apart from using the DSL queries, Is there a way to Asyncronously get the new records as and when new data is inserted into elasticsearch DB.
I dont want to execute the findAll() method again. Because it returns the entire data ( including the previously processed records as well) .
Any help on this is much appreciated.
You will need to add a field (I call it createdAt here) to your entities that contains the timestamp when your application inserts into Elasticsearch. One possibility would be to use the auditing support of Spring Data Elasticsearch to have the value set automatically, or you set the value in your application. If the data is inserted by some other application you need to make sure that it contains a timestamp in a format that maps the field type definition of this field in your application.
Then you'd need to define a method in your repository like
SearchHits<T> findByCreatedAtAfter(Timestamp referenceValue);
As for getting a notification in some form when new data is inserted: I'm not aware that Elasticsearch offers something like that. You will probably need to regularly call the method that retrieves the data.
Scenario: Script pulls data from an external API, formats the results as a dictionary/json object, and pushes the data to elasticsearch. The script is scheduled to run periodically.
Conditions: The script should only push the dictionaries for records that do not already exist in elasticsearch. And for records that exist in elasticsearch, update fields if any data has been changed.
My Approach: The records from the API have an ID which I use to check if they exist in elasticsearch by doing a search query. I make a list of IDs that do not exist in elasticsearch and push the corresponding records to elasticsearch.
Issue: For example, if record with {'ID':1, 'Status':'Started'} was pushed to elasticsearch yesterday. Now the data has changed to {'ID':1, 'Status':'Completed'} it will still be ignored because I am checking only the ID.
Solution that I am thinking of: Insert into elasticsearch by comparing all the fields of the json object/dictionary. If everything matches, skip insertion. If any field has different value insert into elasticsearch [Redundancy of having multiple docs for the same record is not an issue. Redundancy of having multiple docs for the same record with all the same values needs to be avoided.]
You can pass the document ID to the index method. This will insert the record if it doesn't exist or it will update any fields that are different. This way you don't need to add custom logic to manage that ID as a regular field.
From elasticsearch > 2, there is no _timestamp field. we have to explicitly populate time fields like created_on and updated_on
One way i know to populate these fields is check item to be populated is already existing in Database using uid (assume uid generated on client side using some item properties). If item exists in Database, update all fields except created_on. If item does not exist, create entry in database with item and created_on equal to current time.
My questions are:
* Isn't checking every time i create/update redundant ??
* Is there any better way to implement created_on and updated_on logic on client side without redundant (without querying elasticsearch) ??
Using a "middleware" for this is a good way to avoid having this kind of logic in the client, once you change the design, you would need to perform changes on every client implementation, so I think is a good use case for ingesting pipelines and there is an example in the doc.
Accessing Ingest Metadata Fields:
Beyond metadata fields and source fields, ingest also adds ingest metadata to the documents that it processes. These metadata properties are accessible under the _ingest key. Currently ingest adds the ingest timestamp under the _ingest.timestamp key of the ingest metadata. The ingest timestamp is the time when Elasticsearch received the index or bulk request to pre-process the document.
If you need more intelligent middleware, mind the Script Processor which allows inline and stored scripts to be executed within ingest pipelines.
i am uploading data to ELK server through JDBC input plugin in logstash. once i am uploading data to elastic server, when second time its uploading ,data duplication happens. i want to avoid that.
Second thing is if i am using a document id in output section of logstash , then if some rows are updated but the key i am using as document id is the same , then ll it update that field? .
My requirement is i want to avoid data duplication and also want to update the fields which are updated even having same primary key value.
any help will be appreciated! thanks
I'm currently using elasticsearch and running a cron job every 10 minutes that will find newly created/updated data from my DB and sync it with elasticsearch. However, I want to use bulk to sync instead of making and arbitrary amount of requests to update/create documents in an index. I'm using the elasticsearch.js library created by elasticsearch.
I face 2 challenges that I'm uncertain about how to handle:
How to use bulk to update a document if it exists and create a document if it doesn't within bulk without knowing if it exists in the index.
How to format a large amount of JSON to run through bulk to update/create the document because bulk api expects the body to be formatted a certain way.
The best option when trying to stream in data from an SQL database is to use Logstash's JDBC Input to do it for you (the documentation). This can hopefully just do it all for you.
Not all SQL schemes make this easy, so for your specific questions:
How to use bulk to update a document if it exists and create a document if it doesn't within bulk without knowing if it exists in the index.
Bulk currently accepts four different types of sub-requests, which behave differently than you probably expect coming from an SQL world:
index
create
update
delete
The first, index, is the most commonly used option. It means that you want to index (the verb) something to the Elasticsearch index (the noun). However, if it already exists in the index given the same _id, then it will replace it. The rest are probably a bit more obvious.
Each one of the sub-requests behaves like the individual option that they're associated with (so update is an UpdateRequest under the hood, delete is a DeleteRequest, and index is an IndexRequest). In the case of create, it is a specialization of index, which effectively says "add this if it doesn't exist, but fail it if is does exist".
How to format a large amount of JSON to run through bulk to update/create the document because bulk api expects the body to be formatted a certain way.
You should look into using either the Logstash approach or any of the existing client language libraries, such as the Python client, which should work well from cron. The clients will take care of the formatting for you. One for your preferred language most likely already exists.