Build pipeline from Oracle DB to AWS DynamoDB - oracle

I have an Oracle instance running on a stand alone EC2 VM, I want to do two things.
1) Copy the data from one of my Oracle tables into a cloud directory that can be read by DynamoDB. This will only be done once.
2) Then daily I want to append any changes to that source table into the DynamoDB table as another row that will share an id so I can visualize how that row is changing over time.
Ideally I'd like a solution that would be as easy as pipeing the results of a SQL query into a program that dumps that data into a cloud files system (S3, HDFS?), then I will want to convert that data into a format that can be read with DynamoDB.
So I need these things:
1) A transport device, I want to be able to type something like this on the command line:
sqlplus ... "SQL Query" | transport --output_path --output_type etc etc
2) For the path I need a cloud file system, S3 looks like the obvious choice since I want a turn key solution here.
3) This last part is a nice to have because I can always use a temp directory to hold my raw text and convert it in another step.

I assume the "cloud directory" or "cloud file system" you are referring to is S3? I don't see how it could be anything else in this context, but you are using very vague terms.
Triggering the DynamoDB insert to happen whenever you copy a new file to S3 is pretty simple, just have S3 trigger a Lambda function to process the data and insert into DynamoDB. I'm not clear on how you are going to get the data into S3 though. If you are just running a cron job to periodically query Oracle and dump some data to a file, which you then copy to S3, then that should work.
You need to know that you can't append to a file on S3, you would need to write the entire file each time you push new data to S3. If you are wanting to stream the data somehow then using Kenesis instead of S3 might be a better option.

Related

Incremental ETL using Glue

Need help in processing incremental files.
Scenario: Source team is creating file in every 1hr in s3 (hrly partitioned). I would like to process in every 4hr. The Glue etl will read the s3 files (partitioned hrly) and process to store in different s3 folders.
Note : Glue ETL is called from airflow.
Question How can I make sure that I only process the incremental files ( let’s say 4 files in each execution)?
Sounds like a use case for Bookmarks
For example, your ETL job might read new partitions in an Amazon S3
file. AWS Glue tracks which partitions the job has processed
successfully to prevent duplicate processing and duplicate data in the
job's target data store.

How to skip the target write in case zero record count in Informatica Cloud (IICS)?

I have one use case to create incremental Data ingestion pipeline from one Database to AWS S3. I have created a pipeline and it is working fine except for the one scenario where no incremental data was found.
In case of zero record count, it is writing the file with a header-only (parquet file). I want to skip the target write when there is no incremental record.
How I can implement this in IICS?
I have already tried to implement the router transformation where I have put the condition if record count > 0 then only write to target but still it is not working.
First of all: the target file gets created even before any data is read from source. This is to ensure the process has write access to target location. So even if there will be no data to store, an empty file will get created.
The possible ways out here will be to:
Have a command task check the number of lines in output file and delete it if there is just a header. This would require the file to be created locally, verified, and uploaded to S3 afterwards e.g. using Mass Ingestion task - all invoked sequentially via taskflow
Have a session that will first check if there is any data available, and only then run the data extraction.

EMR/Redshift Data Warehouse

We are currently using SQL Server in AWS. We are looking at ways to create a data warehouse from that data in SQL Server.
It seems like the easiest way was to use AWS DMS tool and send data to redshift having it constantly sync. Redshift is pretty expensive so looking at other ways of doing it.
I have been working with EMR. Currently I am using sqoop to take data from SQL Server and put it into Hive. I am currently use the HDFS volume to store data. I have not used S3 yet for that.
Our database has many tables with millions of rows in each.
What is the best way to update this data everyday? Does sqoop support updating data. If not what other tool is used for something like this.
Any help would be great.
My suggestion you can go for Hadoop clusters(EMR) if the processing is too complex and time taken process or better to use Redshift.
Choose the right tool. If it is for the data warehouse then go for Redshift.
And why DMS? are you going to sync in real-time? You want the daily sync. So no need to use DMS.
Better solution:
Make sure you have a primary key column and column that tell us when the row gets updated like updated_at or modified_at.
Run BCP to export the data in bulk from SQL Server to CSV files.
Upload the CSV to S3 then import to RedShift.
Use glue to fetch the incremental data (based on the primary key column and the update_at column) then export it to S3.
Import files from S3 to RedShift staging tables.
Run upsert command (update + insert) to merge the staging table with the main table.
If you feel running the glue is a bit expensive, then use SSIS or Powershell script to do steps 1 to 4. Then psql command to import files from S3 to Redshift and do steps 5 and 6.
This will handle the Insert and updates in your SQL server tables. But the deletes will not be a part of it. If you need all CRUD operations then go for the CDC method with DMS or Debezium. Then push it to S3 and RedShift.

What's the proper way to log big data to organize and store it with Hadoop, and query it using Hive?

So basically I have apps on different platforms that are sending logging data to my server. It's a node server that essentially accepts a payload of log entries and it saves them to their respective log files (as write stream buffers, so it is fast), and creates a new log file whenever one fills up.
The way I'm storing my logs is essentially one file per "endpoint", and each log file consists of space separated values that correspond to metrics. For example, a player event log structure might look like this:
timestamp user mediatype event
and the log entry would then look like this
1433421453 bob iPhone play
Based off of reading documentation, I think this format is good for something like Hadoop. The way I think this works, is I will store these logs on a server, then run a cron job that periodically moves these files to S3. From S3, I could use those logs as a source for a Hadoop cluster using Amazon's EMR. From there, I could query it with Hive.
Does this approach make sense? Are there flaws in my logic? How should I be saving/moving these files around for Amazon's EMR? Do I need to concatenate all my log files into one giant one?
Also, what if I add a metric to a log in the future? Will that mess up all my previous data?
I realize I have a lot of questions, that's because I'm new to Big Data and need a solution. Thank you very much for your time, I appreciate it.
If you have a large volume of log dump that changes periodically, the approach you laid out makes sense. Using EMRFS, you can directly process the logs from S3 (which you probably know).
As you 'append' new log events to Hive, the part files will be produced. So, you dont have to concatenate them ahead of loading them to Hive.
(on day 0, the logs are in some delimited form, loaded to Hive, Part files are produced as a result of various transformations. On subsequent cycles, new events/logs will be appened to those part files.)
Adding new fields on an ongoing basis is a challenge. You can create new data structures/sets and Hive tables and join them. But the joins are going to be slow. So, you may want to define fillers/placeholders in your schema.
If you are going to receive streams of logs (lots of small log files/events) and need to run near real time analytics, then have a look at Kinesis.
(also test drive Impala. It is faster)
.. my 2c.

Map Reduce - How to plan the data files

I would like to use AWS EMR to query large log files that I will write to S3. I can design the files any way I like. The data is created in a rate of 10K entries/minute.
The logs consist of dozens of data points and I'd like to collect data for very long period of time (years) to compare trends etc.
What are the best practices for creating such files that will be stored on S3 and queried by AWS EMR cluster?
Whats the optimal file sizes ?Should I create separate files for example on hourly basis?
What is the best way to name the files?
Should I place them in daily/hourly buckets or all in the same bucket?
Whats the best way to handle things like adding some data after a while or change in data structure that I use?
Should I compress things for example by leaving out domain names out of urls or keep as much data as possible?
Is there any concept like partitioning (the data is based on 100s of websites so I can use site ids for example). I must be able to query all the data together, or by partitions.
Thanks!
in my opinion you should use a hourly basis bucket to store data in s3 and then use a pipeline to schedule your mr job to clean the data.
once u have clean the data you can keep it to a location in s3 and then you can run a data pipeline on hourly basis on the lag of 1hour with respect to your MR pipeline to put this process data into redshift.
Hence at 3am on a day you will have 3 hour of processed data in s3 and 2 hour processed into redshift dB.
To do this you can have 1 machine dedicated for running pipelines and on that machine you can define you shell script/perl/python or so script to load data to your dB.
You can use AWS bucketing formatter for year,month,date,hour and so on. for e.g.
{format(minusHours(#scheduledStartTime,2),'YYYY')}/mm=#{format(minusHours(#scheduledStartTime,2),'MM')}/dd=#{format(minusHours(#scheduledStartTime,2),'dd')}/hh=#{format(minusHours(#scheduledStartTime,2),'HH')}/*

Resources