Best way to send data from Dynamodb to Amazon Elasticsearch - elasticsearch

I was wondering which is the best way to send data from dynamoDB to elasticsearch.
AWS sdk js. https://github.com/Stockflare/lambda-dynamo-to-elasticsearch/blob/master/index.js
DynamoDB logstash plugin: https://github.com/awslabs/logstash-input-dynamodb

Follow this AWS blog. They describe in detail how it is and should be done.
https://aws.amazon.com/blogs/compute/indexing-amazon-dynamodb-content-with-amazon-elasticsearch-service-using-aws-lambda/
edit
I'm assuming you use AWS elasticsearch managed service.
You should use Dynamodb streams in order to listen to changes (among all, you'll have there events of new items added to dynamodb).
Create new Kinesis Firehose stream that is set to output all records to your elasticsearch instance.
Create a new lambda that is triggered by the events of new items in the DynamoDB stream.
The lambda will get the unique DynamoDB record ID so you can fetch it, do fetch the record payload and ingest it to the Firehose stream endpoint.
Depending on your DynamoDB record size, you might enable the option to include the record's payload in the stream item, so you won't need to be fetching it from the table and use the provisioned capacity that you've set.

I recommend creating an AWS Lambda stream on your DynamoDB, then take that data from the Lambda and write it into ElasticSearch.

Related

For Kafka sink Connector I send a single message to multiple indices documents in elasticseach?

I am recieving a very complex json inside a topic message, so i want to do some computations with it using SMTs and send to different elasticsearch indice documents. is it possible?
I am not able to find a solution for this.
The Elasticsearch sink connector only writes to one index, per record, based on the topic name. It's explicitly written in the Confluent documentation that topic altering transforms such as RegexRouter will not work as expected.
I'd suggest looking at logstash Kafka input and Elasticsearch output as an alternative, however, I'm still not sure how you'd "split" a record into multiple documents there either.
You may need an intermediate Kafka consumer such as Kafka Streams or ksqlDB to extract your nested JSON and emit multiple records that you expect in Elasticsearch.

How to transform coloudwatch log events to insert records in ElasticSearch via Kinesis Data Firehose

I am trying to stream my cloudwatch logs to AWS Opensearch(ElasticSearch) by creating a kinesis firehose subscription filter.
I want to understand does Kinesis data firehose supports bulk insert to the Elastic Search via bulk API or it inserts single records ?
So basically, I have created a lambda transformer which aggregates the Cloudwatch Log messages and transforms it to the format required for the bulk insert API of elastic search, I have tested my data format by directly calling the bulk API of elasticsearch and it works fine, however when I try to do the same via firehose I get following error :
message : One or more records are malformed. Please ensure that each record is single valid JSON object and that it does not contain newlines
error code : OS.MalformedData
The transformed data :
{"index":{"_index":"test","_type":"/aws/lambda/HelloWorldLogProducer","_id":"36495659535505340260631192776509739326813505134549729280"}}
{"timeMillis":1636521972493,"thread":"main","level":"INFO","loggerName":"helloworld.App","message":"Inflating logs to increase size","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{"AWSRequestId":"332de390-42b3-44dd-937a-cc4089ff9510"},"threadId":1,"threadPriority":5,"jobId":"${ctx:request_id}","clientUUID":"${ctx:client_uuid}","clientUUIDHeader":"${ctx:client_uuid_header}"}
{"index":{"_index":"test","_type":"/aws/lambda/HelloWorldLogProducer","_id":"36495659535884452929006213369915846537448527280151396358"}}
{"timeMillis":1636521973189,"thread":"main","level":"INFO","loggerName":"helloworld.App","message":"{ \"message\": \"hello world\", \"location\": \"3.137.161.28\" }","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{"AWSRequestId":"332de390-42b3-44dd-937a-cc4089ff9510"},"threadId":1,"threadPriority":5,"jobId":"${ctx:request_id}","clientUUID":"${ctx:client_uuid}","clientUUIDHeader":"${ctx:client_uuid_header}"}
It seems that firehose does not supports bulk insert, like the cloud watch to opensearch subscription filter does, however I cannot define the index name in opensearch subscription filter.
Does anyone faced a similar problem before ?

Topic mapping when streaming from Kafka to Elasticsearch

When I transfer or stream two and three tables then I can easily map in Elasticsearch but can I map automatically map topics to index
I have streamed data from PostgreSQL to ES by mapping manually topic.index.map=topic1:index1,topic2:index2, etc.
Can I map automatically whatever topics send by producer then consumer consume in ES connector automatically?
By default, the topics map directly to an index of the same name.
If you want "better" control, you can use RegexRouter in a transforms property
To quote the docs
topic.index.map
This option is now deprecated. A future version may remove it completely. Please use single message transforms, such as RegexRouter, to map topic names to index names
If you cannot capture a single regex for each topic in the connector, then run more connectors with a different pattern

Reload Elasticsearch index data from DynamoDB using Kenesis Firehouse and Lambda

I'm using Lambda to send DynamoDB Streams to Kenesis Firehouse and then to Elasticsearch.
I would like to know if it's possible to reload Elasticsearch index data from DynamoDB automatically. I mean, If I delete de elasticsearch Index data. How to send again the old one without using any lambda script.
In the view type of DynamoDB I have selected New and old images. I don't know if it's related. Because some times I receive in the elasticsearch records an attribute called OldImage or NewImage that contains the data coming from DynamoDB.

NoSQL (Mongo, DynaoDB) with Elasticsearch vs single Elasticsearch

recently I started to use DynamoDB to store events with structure like this:
{start_date: '2016-04-01 15:00:00', end_date: '2016-04-01 15:30:00', from_id: 320, to_id: 360, type: 'yourtype', duration: 1800}
But when I started to analyze it I faced with the fact that DynamoDB has no aggregations, has read/write limits, response size limits etc. Then I installed a plugin to index data to ES. As a result I see that I do not need to use DynamoDB anymore.
So my question is when do you definitely need to have NoSQL (in my case DynamoDB) instance along with Elasticsearch?
Will it down ES performance when you are storing there not only indexes, but full documents? (yes I know ES is just an index, but anyway, in some cases such approaches could me more cost effective than having MySQL cluster)
The reason you would write data to DynamoDB and then have it automatically indexed in Elasticsearch using DynamoDB Streams is because DynamoDB, or MySQL for that matter, is considered a reliable data store. Elasticsearch is an index and generally speaking isn't considered an appropriate place to store data that you really can't afford to lose.
DynamoDB by itself has issues with storing time series event data and aggregating is impossible as you have stated. However, you can use DynamoDB Streams in conjunction with AWS Lambda and a separate DynamoDB table to materialize views for aggregations depending on what you are trying to compute. Depending on your use case and required flexibility this may be something to consider.
Using Elasticsearch as the only destination for thing such as logs is generally considered acceptable if you are willing to accept the possibility of data loss. If the records you are wanting to store and analyze are really too valuable to lose you really should store them somewhere else and have Elasticsearch be the copy that you query. Elasticsearch allows for very flexible aggregations so it is an excellent tool for this type of use case.
As a total alternative you can use AWS Kinesis Firehose to ingest the events and persistently store them in S3. You can then use an S3 Event to trigger an AWS Lambda function to send the data to Elasticsearch where you can aggregate it. This is an affordable solution with the only major downside being the 60 second delay that Firehose imposes. With this approach if you lose data in your Elasticsearch cluster it is still possible to reload it from the files stored in S3.

Resources