Copy on-prem data to S3 using AWS Data Pipeline - amazon-data-pipeline

How do I import data from an on-prem SQL database to an Amazon S3 using the AWS Data Pipeline?
Links to any tutorials will greatly help me.

You will need to run a Task Runner on-prem. That task runner can execute a pipeline that can export data from SQL data base to S3.
If your database is MySQL, it can directly export the data out. As shown in the below tutorial.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-copydata-mysql.html
If it is not MySQL, then you need to write an export job yourself. For eg. use Talend DI to create an ETL script and wrap the Talend Job in a Shell Activity.
There is also an on-prem Task Runner template.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-localmysqltords.html

Related

Incremental export ( daily changes) from Amazon RDS Oracle Table to Amazon S3

I'm new to AWS and exploring solutions for my use case as defined below
Objective: To export the incremental changes from Oracle RDS (as on that day i.e every day nightly batch) to Amazon S3 in a scheduled fashion ( automatic)
As of my understanding, I could think of the below options for the same
AWS Data pipeline
Using Custom Pipeline i.e using Cloudwatch ->Lambda -> Invoking EC2-> invoking SP to extract the data -> using the Python to invoke and upload the results to S3
Do we have any other options to address this case?
Also could you please guide me on the best optimistic ( including Cost) approach
Thanks & Regards,
Damo

Trigger AWS Lake Formation Workflow programmatically

In the documentation I did not find a mention of programmatically triggering workflows (e.g. to load data from a RDBMS into S3). I also did not find anything in the API specification of AWS Lake Formation. We are using Airflow for scheduling our ETL pipeline and it would be nice if we could trigger the Workflow via an API call.
Lake Formation is used to specify permissions for tables and columns in the Glue Catalog. It does not manage ETL. You can use Glue to setup crawlers to download data from a RDBMS. A crawler can be part of a Glue Workflow.
Reference Links:
Documentation on adding a crawler for RDBMS
Glue API to start a crawler
Glue API to start a workflow

Migrate data from snowflake to elasticsearch

we are using snowflake data warehouse in my project, we would like to replace snowflake with Elasticsearch as part of project enhancement POC,
i don't found any solutions for moving data from snowflake to Elasticsearch.
can anyone help me to resolve the above concerns.
please share sufficient information, steps etc.
Thanks in advance
don't found any clues on data migration.
You can try to do it into 2 steps:
export data from Elastic to AWS S3 bucket
load data from AWS S3 bucket to snowflake.
You need to implement the migration at schema level. Moreover if you specify the question with the issues. It will be helpful to answer and guide you.
You can use COPY command to export data from Snowflake to a file that can then be loaded to another system. However I am curious to know why you are trying to replace Snowflake with Elasticsearch, as these are 2 different technologies, serving very different functions.
You can export your data from Snowflake S3 copy command.
Export in multiparts so your s3 bucket has small files.
Then you can hook a lambda on S3 PUT Object. So on each file upload a lambda will trigger.
You can write code in your Lambda to make rest calls to Elasticsearch.

Loading data automatically from Oracle DB to Google BigQuery

Good day,
I have an Oracle DB and I need to load some tables so I can query them in BigQuery.
¿Is there a way of loading the data automatically, every 24 h, to Google BigQuery?
Any way would work. It could be loading into Data Storage and creating the tables from there, or loading into Google drive from the server.
I really need some ideas, I have read a lot of articles with no luck.
Check this tutorial by Progress:
https://www.progress.com/tutorials/cloud-and-hybrid/etl-on-premises-oracle-data-to-google-bigquery-using-google-cloud-dataflow
In this tutorial the main goal will be to connect to an On-Premises Oracle database, read the data, apply a simple transformation and write it to BigQuery. The code for this project has been uploaded to GitHub for your reference.
This solution uses Dataflow and Progress' Hybrid Data Pipeline tool:
Google Cloud Dataflow is a data processing service for both batch and real-time data streams. Dataflow allows you to build pipes to ingest data, then transform and process according to your needs before making that data available to analysis tools. DataDirect Hybrid Data Pipeline can be used to ingest both on-premises and cloud data with Google Cloud Dataflow.

Different tools available for creating data pipelines

I need to create data pipelines in hadoop. I have data import, export, scripts to clean data set up and need to set it up in a pipeline now.
I have been using Oozie for data import and export schedules but now need to integrate R scripts for data cleaning process as well.
I see falcon is used for the same.
How to install falcon in cloudera?
What other tools are available to create data pipelines in hadoop?
2) I'm tempted to answer nifi from Hortonworks, since this post on linkedin it has grown a lot and it's very close to replace oozie. When I'm writing this answer the difference between oozie and nifi is the place where they run: nifi on external cluster and oozie into hadoop.

Resources