I set up Elasticsearch Service and DynamoDb stream as described in this blog post.
Now I need to add pre-existing data from DynamoDB to Elasticsearch.
I saw "Indexing pre-existing content" part of article but I dont know what to do with that Python code, where to execute it.
What the best option in this case to add pre-existing data?
Populating existing items to elasticsearch is not straightforward since dynamodb stream works for item changes not for existing records,
Here are few approaches with pro and cons
Scan all the existing items from dynamodb and send to elasticsearch
We can scan all the existing items and run a python code hosted on a ec2 machine to send data to es.
Pros:
a. Simple solution, nothing much required.
Cons:
a. Can not be run on a lambda function since the job may timeout if number of records are too many.
b. This approach is more of a one time thing and can not be used for incremental changes, (let's say we want to keep updating es as dynamodb data changes.)
Use dynamodb streams
We can enable dynamodb streams and build the pipeline as explained here.
Now we can update some flag of existing items so that all the records flow through the pipeline and data goes to es.
Pros:
a. The pipeline can be used for incremental dynamodb changes.
b. No code duplication or one time effort, Every time we need to update one item in es, we update the item and it gets indexed in es.
c. No redundant, untested, one time code. (Huge issue in software world to maintain code.)
Cons:
a. Changing Prod data can be a dangerous thing and may not be allowed depending on use case.
This is slight modification of above approach
Instead of changing item in prod table we can create a Temporary table and enable stream on Temporary table. Utilize the pipeline mentioned in 2nd approach. And then copy items from prod table to Temporary table, The data will flow through the existing pipeline and get indexed in ES.
Pros:
a. No Prod data change is required and this pipeline can be used for incremental changes as well.
b. same as approach 2.
Cons:
a. Copying data from one table to another may take lots of time depending on data size.
b. Copying data from one table to another is a one time script, hence has maintainability issues.
Feel free to edit or suggest another approaches in comment.
In this post described how to add pre-existing data from DynamoDB to Elasticsearch.
Related
Currently, there's a denormalized table inside a MySQL database that contains hundreds of columns and millions of records.
The original source of the data does not have any way to track the changes so the entire table is dropped and rebuilt every day by a CRON job.
Now, I would like to import this data into Elaticsearch. What is the best way to approach this? Should I use logstash to connect directly to the table and import it or is there a better way?
Exporting the data into JSON or similar is an expensive process since we're talking about gigabytes of data every time.
Also, should I drop the index in elastic as well or is there a way to make it recognize the changes?
In any case - I'd recommend using index templates to simplify index creation.
Now for the ingestion strategy, I see two possible options:
Rework your ETL process to do a merge instead of dropping and recreating the entire table. This would definitely be slower but would allow shipping only deltas to ES or any other data source.
As you've imagined yourself - you should be probably fine with Logstash using daily jobs. Create a daily index and drop the old one during the daily migration.
You could introduce buffers, such as Kafka to your infrastructure, but I feel that might be an overkill for your current use case.
I need to aggregate data coming from DynamoDB to AWS Redshift, and I need to be accurate and in-sync. For the ETL I'm planning to use DynamoDB Streams, Lambda transform, Kinesis Firehorse to, finally, Redshift.
How would be the process for updated data? I find it's all fine-tuned just for ETL. Which should be the best option to maintain both (Dynamo and Redshift) in sync?
These are my current options:
Trigger an "UPDATE" command direct from Lambda to Redshift (blocking).
Aggregate all update/delete records and process them on an hourly basis "somehow".
Any experience with this? Maybe is Redshift not the best solution? I need to extract aggregated data for reporting / dashboarding on 2 TB of data.
Redshift COPY command supports using a DyanmoDB table as a data source. This may or may not be a possible solution in your case as there are some limitations to this process. Data types and table naming differences can trip you up. Also this isn't a great option for incremental updates but can be done if the amount of data is small and you can design the updating SQL.
Another route to look at DynamoDB Stream. This will route data updates through Kinesis and this can be used to update Redshift at a reasonable rate. This can help keep data synced between these databases. This will likely make the data available for Redshift as quickly as possible.
Remember that you are not going to get Redshift to match on a moment by moment bases. Is this what you mean by "in-sync"? These are very different databases with very different use cases and architectures to support these use cases. Redshift works in big chunks of data changing slower than what typically happens in DynamoDB. There will be updating of Redshift in "chunks" which happen a more infrequent rate than on DynamoDB. I've made systems to bring this down to 5min intervals but 10-15min update intervals is where most end up when trying to keep a warehouse in sync.
The other option is to update Redshift infrequently (hourly?) and use federated queries to combine "recent" data with "older data" stored in Redshift. This is a more complicated solution and will likely mean changes to your data model to support but doable. So only go here if you really need to query very recent data right along side with older and bigger data.
The best-suited answer is to use a Staging table with an UPSERT operation (or a Redshift interpretation of it).
I found the answer valid on my use case when:
Keep Redshift as up to date as possible without causing blocking.
Be able to work with complex DynamoDB schemas so they can't be used as a source directly and data has to be transformed to adapt to Redshift DDL.
This is the architecture:
So we constantly load from Kinesis using the same COPY mechanism, but instead of loading directly to the final table, we use a staging one. Once the batch is loaded into staging we seek for duplicates between the two tables. Those duplicates on the final table will be DELETED before an INSERT is performed.
After trying this I've found that all DELETE operations on the same batch perform better if enclosed within a unique transaction. Also, a VACUUM operation is needed in order to re-balance the new load.
For further detail on the UPSERT operation, I've found this source very useful.
I'm currently trying to find out the best architecture approach for my use case:
I have S3 buckets (two totally separated) which contains data stored in JSON format. Data is partitioned by year/month/day prefixes, and inside particular day I can find e.g hundreds of files for this date
(example: s3://mybucket/2018/12/31/file1,
s3://mybucket/2018/12/31/file2, s3://mybucket/2018/12/31/file..n)
Unfortunately inside particular prefix for single day, in those tens..or hundreds files JSONs are not ordered by exact timestamp - so if we follow this example:
s3://mybucket/2018/12/31/
I can find:
file1 - which contains JSON about object "A" with timestamp "2018-12-31 18:00"
file100 - which contains JSON about object "A" with timestamp "2018-12-31 04:00"
What even worse...the same scenario I have with my second bucket.
What I want to do with this data?
Gather my events from both buckets, ordered by "ID" of object, in a sorted way (by timestamp) to visualize that in timeline at last step (which tools and how it's out of scope).
My doubts are more how to do it:
In cost efficient way
Cloud native (in AWS)
With smallest possible maintenance
What I was thinking of:
Not sure if...but loading every new file which arrived on S3 to DynamoDB (using Lambda triggered). AFAIK Creating table in proper approach - my ID as Hask key and timestamp as Range Key should works for me, correct?
As every new row inserted will be partitioned to particular ID, and already ordered in correct manner - but I'm not an expert.
Use Log-stash to load data from S3 to ElasticSearch - again AFAIK everything in ES can be indexed, so also sorted. Timelion will probably allow me to do those fancy analysis I need to created. But again....not sure if ES will perform as I want...price...volume is big etc.
??? No other ideas
To help somehow understand my need and show a bit data structure I prepared this: :)
example of workflow
Volume of data?
Around +- 200 000 events - each event is a JSON with 4 features (ID, Event_type, Timestamp, Price)
To summarize:
I need put data somewhere effectively, minimizing cost, sorted to maintain at next step front end to present how events are changing based on time - filtered by particular "ID".
Thank and appreciate for any good advice, some best practices, or solutions I can rely on!:)
#John Rotenstein - you are right I absolutely forgot to add those details. Basically I don't need any SQL functionality, as data will be not updated. Only scenario is that new event for particular ID will just arrive, so only new incremental data. Based on that, my only operation I will do on this dataset - is "Select". That's why I would prefer speed and instant answer. People will look at this mostly per each "ID" - so using filtering. Data is arriving every 15 minut on S3 (new files).
#Athar Khan - thanks for good sugestion!
As far as I understand this, I would choose the second option of Elasticsearch, with Logstash loading the data from S3, and Kibana as the tool to investigate, search, sort and visualise.
Having lambda pushing data from s3 to DynamoDB would probably work, but might be less efficient and cost more, as you are running a compute process on each event, while pushing to Dynamo in small/single-item bulks. Logstash, on the other hand, would read the files one by one and process them all. It also depends on how ofter do you plan to load fresh data to S3, but both solution should fit.
The fact that the timestamps are not ordered in the files would not make an issue in elasticsearch and you can index them on any order you can, you would still be able to visualise and search them in kibana in a time based sorted order.
My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.
I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.
So roughly speaking something like this:-
For examples sake the following data then exists in the result database from the first flow :-

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.
Couple questions:-
How would I trigger dedupe to run on the merged content was pushed to the database?
The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling?
I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).
Thanks!
For de-duplicating...
You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.
For triggering the second flow...
Do you need that intermediate DB table for something else?
If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.
If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.
Here is what we came up with. By using 3 value status column.
0 = Not indexed
1 = Updated
2 = Indexed
There will be 2 jobs...
Job 1 will select top X records where status = 0 and pop them into a queue like RabitMQ.
Then a consumer will bulk insert those records to ES and update the status of DB records to 1.
For updates, since we have control of our data... The SQL stored proc that updates that particular record will set it's status to 2. Job2 will select top x records where status = 2 and pop them on RabitMQ. Then a consumer will bulk insert those records to ES and update the status of DB records to 1.
Of course we may need an intermediate status for "queued" so none of the jobs pick up the same record again but the same job should not run if it hasn't completed. The chances of a queued record being updated are slim to none. Since updates only happen at end of day usually the next day.
So I know there's rivers (but being deprecated and probably not flexible like ETL)
I would like to bulk insert records from my SQL server to Elasticsearch.
Write a scheduled batch job of some sort either ETL or any other tool doesn't matter.
select from table where id > lastIdInsertedToElasticSearch this will allow to load the latest records into Elasticsearch at scheduled interval.
But what if a record is updated in the SQL server? What would be a good pattern to track updated records in the SQL server and then push the updated records in ES? I know ES has document versions when putting the same Id. But can't seem to be able to visualize a pattern.
So IMHO, batch inserts are good for building or re-building the index. So for the first time, you can run batch jobs that run SQL queries and perform bulk updates. Rivers, as you correctly pointed out, don't provide a lot of flexibility in terms of transformation.
If the entries in your SQL data store are created by you (i.e. some codebase in your control), it would be better that the same code base updates documents in Elasticsearch, may be not directly but by notifying some other service or with the help of queues to not waste time in responding to requests (if that's the kind of setup you have).
We have a pretty similar use case of Elasticsearch. We provide search inside our app, which performs search across different categories of data. Some of this data is actually created by the users of our app through our app - so we handle this easily. Our app writes that data to our SQL data store and pushes the same data in RabbitMQ for indexing/updating in Elasticsearch. On the other side of RabbitMQ, we have a consumer written in Python that basically replaces the entire document in Elasticsearch. So the corresponding rows in our SQL datastore and documents in Elasticsearch share the ID which enables us to update the document.
Another case is where there are a few types of data that we perform search on comes from some 3rd party service which exposes the data over their HTTP API. The data creation is in our control but we don't have an automated mechanism of updating the entries in Elasticsearch. In this case, we basically run a cron job that takes care of this. We have managed to tune the cron's schedule because we also have a limited number of API queries quota. But in this case, our data is not really updated so much per day. So this kind of system works for us.
Disclaimer: I co-developed this solution.
I needed something like the jdbc-river that could do more complex "roll-ups" of data. After careful consideration of what it would take to modify the jdbc-river to suit my needs, I ended up writing the river-net.
Here are a few of the features:
It gets fairly decent performance (comparable to the jdbc-river. We get upwards of 6k rows/sec)
It can join many tables to create complex nested arrays of documents without creating duplicate child documents
It follows a lot of the same conventions as the jdbc-river.
It also supports reading from files.
It's written in C#
It uses Quartz.Net and supports cron expressions for scheduling.
This project is open source, and we already have a second project (also to be open sourced) that does generic job scheduling with RabbitMQ. We have ported over a lot of this project, and plan to the RabbitMQ river for better performance and stability when indexing into Elasticsearch.
To combat large updates, we aren't hitting tables directly. Instead we use stored procedures that only grab deltas. We also have an option on the sp to reset the delta to reindex everything.
The project is fairly young with only a few commits, but we are open to collaboration and new ideas.