Implement created_on and updated_on logic on client side - elasticsearch

From elasticsearch > 2, there is no _timestamp field. we have to explicitly populate time fields like created_on and updated_on
One way i know to populate these fields is check item to be populated is already existing in Database using uid (assume uid generated on client side using some item properties). If item exists in Database, update all fields except created_on. If item does not exist, create entry in database with item and created_on equal to current time.
My questions are:
* Isn't checking every time i create/update redundant ??
* Is there any better way to implement created_on and updated_on logic on client side without redundant (without querying elasticsearch) ??

Using a "middleware" for this is a good way to avoid having this kind of logic in the client, once you change the design, you would need to perform changes on every client implementation, so I think is a good use case for ingesting pipelines and there is an example in the doc.
Accessing Ingest Metadata Fields:
Beyond metadata fields and source fields, ingest also adds ingest metadata to the documents that it processes. These metadata properties are accessible under the _ingest key. Currently ingest adds the ingest timestamp under the _ingest.timestamp key of the ingest metadata. The ingest timestamp is the time when Elasticsearch received the index or bulk request to pre-process the document.
If you need more intelligent middleware, mind the Script Processor which allows inline and stored scripts to be executed within ingest pipelines.

Related

Fetching Index data from elasticsearch DB using Spring Data ElasticSearch

I have a java code which connects to Elasticsearch DB using Spring-data-elasticsearch and fetches all the index data by connecting to the repository and executing the findAll() method. The data received from ES is being processed by a seperate application. When new data is inserted into elastic search, I have the below queries
1. How can I fetch only the newly inserted data Programatically ?
2. Apart from using the DSL queries, Is there a way to Asyncronously get the new records as and when new data is inserted into elasticsearch DB.
I dont want to execute the findAll() method again. Because it returns the entire data ( including the previously processed records as well) .
Any help on this is much appreciated.
You will need to add a field (I call it createdAt here) to your entities that contains the timestamp when your application inserts into Elasticsearch. One possibility would be to use the auditing support of Spring Data Elasticsearch to have the value set automatically, or you set the value in your application. If the data is inserted by some other application you need to make sure that it contains a timestamp in a format that maps the field type definition of this field in your application.
Then you'd need to define a method in your repository like
SearchHits<T> findByCreatedAtAfter(Timestamp referenceValue);
As for getting a notification in some form when new data is inserted: I'm not aware that Elasticsearch offers something like that. You will probably need to regularly call the method that retrieves the data.

How to add only new docs or changed docs on elasticsearch?

Scenario: Script pulls data from an external API, formats the results as a dictionary/json object, and pushes the data to elasticsearch. The script is scheduled to run periodically.
Conditions: The script should only push the dictionaries for records that do not already exist in elasticsearch. And for records that exist in elasticsearch, update fields if any data has been changed.
My Approach: The records from the API have an ID which I use to check if they exist in elasticsearch by doing a search query. I make a list of IDs that do not exist in elasticsearch and push the corresponding records to elasticsearch.
Issue: For example, if record with {'ID':1, 'Status':'Started'} was pushed to elasticsearch yesterday. Now the data has changed to {'ID':1, 'Status':'Completed'} it will still be ignored because I am checking only the ID.
Solution that I am thinking of: Insert into elasticsearch by comparing all the fields of the json object/dictionary. If everything matches, skip insertion. If any field has different value insert into elasticsearch [Redundancy of having multiple docs for the same record is not an issue. Redundancy of having multiple docs for the same record with all the same values needs to be avoided.]
You can pass the document ID to the index method. This will insert the record if it doesn't exist or it will update any fields that are different. This way you don't need to add custom logic to manage that ID as a regular field.

ElasticSearch: querying most recent snapshot design

I'm trying to decide how to structure the data in ElasticSearch.
I have a system that is producing metrics on a daily basis. I would like to put those metrics into ES so I could do some advances querying/sorting. I also only care about the most recent data that's in there. The system producing the data could also be late.
Currently I can think of two options:
I can have one index with a date column that contains the date that the metric was created. I am unsure, however, of how to write the query so that if multiple days worth of data are in the index I filter it to just the most recent set.
I could also try and split the data up into different indexes (recent and past) and have some sort of process that migrates data from the recent index to the past index. I think the challenge with this would be having downtime where the data is being moved and/or added into the recent.
Thoughts?
A common approach to solving this problem with elastic search would be to store data in a form that allows historic querying, then again in a second form that allows querying the most recent data. For example if your metric update looked like:
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
Then it can be indexed into our current values index using a composite key constructed from the document (obviously, for this to work you'd need to be able to construct a composite key from your document!). For example, your identity for this document might be the type and name concatenated. You then leverage the upsert API to allow you to write your updates to the same document:
POST current_metrics/_update/OperationsPerSecond-Questions
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
Every time you call this API with the same composite key it will update the existing document, rather than create a new document. This will give you an index that only contains a single record per metric you are monitoring, and you can query that index to get your most recent values.
To store your historic data, you change your primary key strategy, it would probably be most straightforward to use the index API and get elastic to generate a primary key for you.
POST all_metrics/_doc/
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
This API will create a new document for every request made to it. So as long as you have something in your data that you can use in an elastic range query, such as a field like createdDate with a value that looks like a date time, then you should be able to query historic data.
The main thing is, don't worry about duplicating your data for different purposes, elastic does a good job of compressing this stuff on disk and in memory. Storing data multiple times is called denormalization and is a pretty common technique in data warehousing and big data.

Using elasticsearch generated ID's in kafka elasticsearch connector

I noticed that documents indexed in elasticsearch using the kafka elasticsearch connector have their ids in the following format topic+partition+offset.
I would prefer to use id's generated by elasticsearch. It seems topic+partition+offset is not usually unique so I am loosing data.
How can I change that?
As Phil says in the comments -- topic-partition-offset should be unique, so I don't see how this is causing data loss for you.
Regardless - you can either let the connector generate the key (as you are doing), or you can define the key yourself (key.ignore=false). There is no other option.
You can use Single Message Transformations with Kafka Connect to derive a key from the fields in your data. Based on your message in the Elasticsearch forum it looks like there is an id in your data - if that's going to be unique you could set that as your key, and thus as your Elasticsearch document ID too. Here's an example of defining a key with SMT:
# Add the `id` field as the key using Simple Message Transformations
transforms=InsertKey, ExtractId
# `ValueToKey`: push an object of one of the column fields (`id`) into the key
transforms.InsertKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.InsertKey.fields=id
# `ExtractField`: convert key from an object to a plain field
transforms.ExtractId.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.ExtractId.field=id
(via https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/)
#Robin Moffatt, as much as I see it, topic-partition-offset can cause duplicates in case that upgrade your kafka cluster, but not in rolling upgrade fashion but just replace cluster with cluster (which is sometime easier to replace). In this case you will experience data loss because of overwriting data.
Regarding to your excellent example, this can be the solution for many of the cases, but I'd add another option. Maybe you can add epoc timestamp element to the topic-partition-offset so this will be like this topic-partition-offset-current_timestamp.
What do you think?

Using ElasticSearch Bulk to update and create documents dynamically?

I'm currently using elasticsearch and running a cron job every 10 minutes that will find newly created/updated data from my DB and sync it with elasticsearch. However, I want to use bulk to sync instead of making and arbitrary amount of requests to update/create documents in an index. I'm using the elasticsearch.js library created by elasticsearch.
I face 2 challenges that I'm uncertain about how to handle:
How to use bulk to update a document if it exists and create a document if it doesn't within bulk without knowing if it exists in the index.
How to format a large amount of JSON to run through bulk to update/create the document because bulk api expects the body to be formatted a certain way.
The best option when trying to stream in data from an SQL database is to use Logstash's JDBC Input to do it for you (the documentation). This can hopefully just do it all for you.
Not all SQL schemes make this easy, so for your specific questions:
How to use bulk to update a document if it exists and create a document if it doesn't within bulk without knowing if it exists in the index.
Bulk currently accepts four different types of sub-requests, which behave differently than you probably expect coming from an SQL world:
index
create
update
delete
The first, index, is the most commonly used option. It means that you want to index (the verb) something to the Elasticsearch index (the noun). However, if it already exists in the index given the same _id, then it will replace it. The rest are probably a bit more obvious.
Each one of the sub-requests behaves like the individual option that they're associated with (so update is an UpdateRequest under the hood, delete is a DeleteRequest, and index is an IndexRequest). In the case of create, it is a specialization of index, which effectively says "add this if it doesn't exist, but fail it if is does exist".
How to format a large amount of JSON to run through bulk to update/create the document because bulk api expects the body to be formatted a certain way.
You should look into using either the Logstash approach or any of the existing client language libraries, such as the Python client, which should work well from cron. The clients will take care of the formatting for you. One for your preferred language most likely already exists.

Resources