Reload Elasticsearch index data from DynamoDB using Kenesis Firehouse and Lambda - elasticsearch

I'm using Lambda to send DynamoDB Streams to Kenesis Firehouse and then to Elasticsearch.
I would like to know if it's possible to reload Elasticsearch index data from DynamoDB automatically. I mean, If I delete de elasticsearch Index data. How to send again the old one without using any lambda script.
In the view type of DynamoDB I have selected New and old images. I don't know if it's related. Because some times I receive in the elasticsearch records an attribute called OldImage or NewImage that contains the data coming from DynamoDB.

Related

BIgquery to elasticsearch (Avoid adding duplicate documents to elasticsearch.)

I am trying to sync data between Bigquery and elasticsearch using the job template provided in GCP. The issue is that Bigquery sends all the documents everytime the job is run, now as elasticsearch has the document id as _id ,it creates duplicate documents.
Is there a way by which we can configure data _id field while sending data from bigquery to elasticsearch.

How to transform coloudwatch log events to insert records in ElasticSearch via Kinesis Data Firehose

I am trying to stream my cloudwatch logs to AWS Opensearch(ElasticSearch) by creating a kinesis firehose subscription filter.
I want to understand does Kinesis data firehose supports bulk insert to the Elastic Search via bulk API or it inserts single records ?
So basically, I have created a lambda transformer which aggregates the Cloudwatch Log messages and transforms it to the format required for the bulk insert API of elastic search, I have tested my data format by directly calling the bulk API of elasticsearch and it works fine, however when I try to do the same via firehose I get following error :
message : One or more records are malformed. Please ensure that each record is single valid JSON object and that it does not contain newlines
error code : OS.MalformedData
The transformed data :
{"index":{"_index":"test","_type":"/aws/lambda/HelloWorldLogProducer","_id":"36495659535505340260631192776509739326813505134549729280"}}
{"timeMillis":1636521972493,"thread":"main","level":"INFO","loggerName":"helloworld.App","message":"Inflating logs to increase size","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{"AWSRequestId":"332de390-42b3-44dd-937a-cc4089ff9510"},"threadId":1,"threadPriority":5,"jobId":"${ctx:request_id}","clientUUID":"${ctx:client_uuid}","clientUUIDHeader":"${ctx:client_uuid_header}"}
{"index":{"_index":"test","_type":"/aws/lambda/HelloWorldLogProducer","_id":"36495659535884452929006213369915846537448527280151396358"}}
{"timeMillis":1636521973189,"thread":"main","level":"INFO","loggerName":"helloworld.App","message":"{ \"message\": \"hello world\", \"location\": \"3.137.161.28\" }","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{"AWSRequestId":"332de390-42b3-44dd-937a-cc4089ff9510"},"threadId":1,"threadPriority":5,"jobId":"${ctx:request_id}","clientUUID":"${ctx:client_uuid}","clientUUIDHeader":"${ctx:client_uuid_header}"}
It seems that firehose does not supports bulk insert, like the cloud watch to opensearch subscription filter does, however I cannot define the index name in opensearch subscription filter.
Does anyone faced a similar problem before ?

Fetching Index data from elasticsearch DB using Spring Data ElasticSearch

I have a java code which connects to Elasticsearch DB using Spring-data-elasticsearch and fetches all the index data by connecting to the repository and executing the findAll() method. The data received from ES is being processed by a seperate application. When new data is inserted into elastic search, I have the below queries
1. How can I fetch only the newly inserted data Programatically ?
2. Apart from using the DSL queries, Is there a way to Asyncronously get the new records as and when new data is inserted into elasticsearch DB.
I dont want to execute the findAll() method again. Because it returns the entire data ( including the previously processed records as well) .
Any help on this is much appreciated.
You will need to add a field (I call it createdAt here) to your entities that contains the timestamp when your application inserts into Elasticsearch. One possibility would be to use the auditing support of Spring Data Elasticsearch to have the value set automatically, or you set the value in your application. If the data is inserted by some other application you need to make sure that it contains a timestamp in a format that maps the field type definition of this field in your application.
Then you'd need to define a method in your repository like
SearchHits<T> findByCreatedAtAfter(Timestamp referenceValue);
As for getting a notification in some form when new data is inserted: I'm not aware that Elasticsearch offers something like that. You will probably need to regularly call the method that retrieves the data.

Best way to send data from Dynamodb to Amazon Elasticsearch

I was wondering which is the best way to send data from dynamoDB to elasticsearch.
AWS sdk js. https://github.com/Stockflare/lambda-dynamo-to-elasticsearch/blob/master/index.js
DynamoDB logstash plugin: https://github.com/awslabs/logstash-input-dynamodb
Follow this AWS blog. They describe in detail how it is and should be done.
https://aws.amazon.com/blogs/compute/indexing-amazon-dynamodb-content-with-amazon-elasticsearch-service-using-aws-lambda/
edit
I'm assuming you use AWS elasticsearch managed service.
You should use Dynamodb streams in order to listen to changes (among all, you'll have there events of new items added to dynamodb).
Create new Kinesis Firehose stream that is set to output all records to your elasticsearch instance.
Create a new lambda that is triggered by the events of new items in the DynamoDB stream.
The lambda will get the unique DynamoDB record ID so you can fetch it, do fetch the record payload and ingest it to the Firehose stream endpoint.
Depending on your DynamoDB record size, you might enable the option to include the record's payload in the stream item, so you won't need to be fetching it from the table and use the provisioned capacity that you've set.
I recommend creating an AWS Lambda stream on your DynamoDB, then take that data from the Lambda and write it into ElasticSearch.

Does ElasticSearch store a duplicate copy of each record?

I started looking into ElasticSearch, and most examples of creating and reading involve POSTing data to the ElasticSearch server and then doing a GET to retrieve them.
Is this data that is POSTed stored separately by the ElasticSearch server? So, if I want to use ElasticSearch with MongoDB, does the raw data, not including the search indices, get stored twice (once copy for MongoDB and one for ElasticSearch)?
In conjunction with an answer to this question, a description or a link to a description of how ElasticSearch and the primary data store interact would be very helpful.
Yes, ElasticSearch can only search within its own data store, so a separate copy will be there.
You can use the mongodb connector to keep the data in elastic in sync with the mongo database: https://github.com/mongodb-labs/mongo-connector

Resources