Manipulating Data Within AWS Redshift to a Schedule - etl

Current Setup:
SQL Server OLTP database
AWS Redshift OLAP database updated from OLTP
via SSIS every 20 minutes
Our customers only have access to the OLAP Db
Requirement:
One customer requires some additional tables to be created and populated to a schedule which can be done by aggregating the data already in AWS Redshift.
Challenge:
This is only for one customer so I cannot leverage the core process for populating AWS; the process must be independent and is to be handed over to the customer who do not use SSIS and don't wish to start. I was considering using Data Pipeline but this is not yet available in the market in which the customer resides.
Question:
What is my alternative? I am aware of numerous partners who offer ETL like solutions but this seems over the top, ultimately all I want to do is execute a series of SQL statements on a schedule with some form of error handling/ alert. Preference of both customer and management is to not use a bespoke app to do this, hence the intended use of Data Pipeline.

For exporting data from AWS Redshift to another data source using datapipeline you can follow a template similar to https://github.com/awslabs/data-pipeline-samples/tree/master/samples/RedshiftToRDS using which data can be transferred from Redshift to RDS. But instead of using RDSDatabase as the sink you could add a JdbcDatabase (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-jdbcdatabase.html). The template https://github.com/awslabs/data-pipeline-samples/blob/master/samples/oracle-backup/definition.json provides more details on how to use the JdbcDatabase.
There are many such templates available in https://github.com/awslabs/data-pipeline-samples/tree/master/samples to use as a reference.

I do exactly the same thing as you, but I use lambda service to perform my ETL. One drawback of lambda service is, it can run max of 5 mins (Initially 1 min) only.
So for ETL's greater than 5 minutes, I am planning to set up PHP server in AWS and with SQL injection I can run my queries, scheduled at any time with help of cron function.

Related

ADF Copy Data remove index in oracle Sink

I am trying to insert data from a SQL table to an Oracle table using activity Copy Data in Data Factory, on the first try it runs fine but on the second try it throws an error that an index on the target table (Oracle) has been corrupted.
Searching in different forums I found that apparently the Copy Data activity sends the insert statement in the following way: INSERT /*+ SYS_DL_CURSOR */ INTO
any idea how to fix this???
Thank you very much for the help
As per the error index is not corrupted. It was used twice. May be the operation was not planned according to the schedule and worked parallelly.
The Copy activity is executed on an integration runtime. You can use different types of integration runtimes for different data copy scenarios:
When you're copying data between two data stores that are publicly accessible through the internet from any IP, you can use the Azure integration runtime for the copy activity. This integration runtime is secure, reliable, scalable, and globally available.
When you're copying data to and from data stores that are located on-premises or in a network with access control (for example, an Azure virtual network), you need to set up a self-hosted integration runtime.
Use either of the two operations mentioned above, the error will be resolved.
Check link for support document: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview

Database Migration from Oracle RAC to AWS Amazon Aurora

I am working on a task to make a data migration plan to migrate Oracle RAC to AWS Amazon Aurora.
The current in-house production database is based on a 10TB, 8 Node Oracle RAC Cluster with single node
standby in DR site. The database has 2 main schemas, comprising of 500 tables, 300 Packages and Triggers, 20 Partitioned tables, 5000 concurrent session of which 100 are active at a given time and has an IOPS requirement of 50K read and 30K write IOPS. The development database is 1/10th of the production capacity.
I did research and found that DMS (Data Migration Service) and SCT (Schema Conversion Tool) takes care of all the migration process. So do we need to work on any individual specifications mentioned in the task or will DMS and SCT take care of the whole migration process?
The tools you mention (DMS and SCT) are powerful and useful, but will they take care of the whole migration process? Very unlikely unless you have a very simple data model.
There will likely be some objects and code that cannot be converted automatically and will need manual input/development from you. Migrating a database is usually not a simple thing and even with tools like SCT and DMS you need to be prepared to plan, review and test.
SCT can produce an assessment report for you. I would start here. Your question is next to impossible to answer on a forum like this without intricate knowledge of the system you are migrating.

How to launch SQL user defined procedure exclusively, without transaction, using external launcher.exe?

I am migrating the older application from on-premise "old and good" SQL server to Azure SQL. So far, so good.
The old solution used Job Agent to launch the usp_data_pump to get some data from 3rd party database. The first run (having my database empty) takes about 30 minutes. Because of added optimizations, the next runs take about 5 seconds when the watched data did not change in the other database. It can take more time, but--because of how the data are created--it will still be rather seconds. However, in some situations, the my database content can be "reset" (user action), and then it can again take those 30 minutes or so.
I need to pump the data each 5 minutes to get the minor changes.
As the Azure SQL does not have Job Agent, I have decided to use the near by Azure Windows Server and its standard Scheduler to execute launcher.exe every 5 minutes that just connects to the Azure SQL Server, executes the usp_data_pump stored procedure and stops. However, when the scheduler acts, it runs "forever".
I am not sure what happens. The first thought was that the launcher.exe was launched again after the 5 minutes when the previous did not finished its task, yet. However, in the Settings tab of the scheduled tasks the options is set...
Do not start new instance
Firstly, how to implement periodical, exclusive execution of the usp_data_pump procedure. The transaction must not be used inside.
Azure SQL database has two auto jobs for us SQL agent for Azure MI and Elastic jobs for Azure SQL database You can use the Elastic Jobs to execute your stored procedures.
Ref this document: Automate management tasks using elastic jobs:
You can create and schedule elastic jobs that could be periodically
executed against one or many Azure SQL databases to run Transact-SQL
(T-SQL) queries and perform maintenance tasks.
The Elastic Job agent is free. The job database is billed at the same rate as any database in Azure SQL Database.
HTH.

Manually logging database event in datastage job

i have a parallel job that writes in oracle table. I want to manually write warnings in Datastage's log if some event occur. For example if a certain value for a certain column is inserted i want to track this information in the log. Could this be achieved somehow?
To write custom messages into the logs for a particular jobs data stream, you can use a combination of a copy stage, transformer, and peak stage. The peak stage is the one that writes to the logs. I like to set the peak stage to run in sequential mode, so that your messages are kept together in single entries in the log, instead across nodes.
Also, you can peak the rejects of the oracle stage. maybe combine this with the above option (using a funnel stage and a standard column schema).
Lastly, if you'd actually like to query the logs themselves and write those logs out somewhere else or use them in a job (amoungst allother data kept about jobs in the repository). You can directly query the DSODB schema in the XMETA database. I.e. the DataStage repository (by default DB2).
You would need to have the DataStage Operations Console up and running for that (not sure what version of DataStage you're running). If DataStage is running on a single tier and using the default DB2 database. You can simply catalog the DSODB database so that it's available as a connection in the DB2 connector. Else you'd need to install a DB2 client on the DataStage engine tier and catalog the database there.
All the best!
Twitter: #InforgeAcademy
DataStage tips and Tricks: https://www.inforgeacademy.com/blog/

What would be the best way to upload data with Azure Data Factory to Azure Database

I'm looking for the best way to upload files to a Azure SQL Database.
We have to use Azure Data Factory as at this moment we are not allowed to use Azure VM's with SSIS.
Each day we are upload 1,5Gb of XML files.
Currently we we are uploading the to a Blob storage and with a Copy activity we are uploading them into the DB.
But this takes up to 2,5 hours.
What would be a better/faster concept to do this ?
Any Suggestions ?
You can use the bcp utility to import your data into an instance of SQL Server as explained in this document. From there, you can use the Azure Data Synchronization tool to synchronize your Azure SQL Database with your SQL Server Database. This may provide a faster execution time.
Finally it dropped to 35 minutes.
Just by using several storage accounts (5) and split the data over the 5 accounts.
5 ADF pipelines uploding all into the same staging table.
Some where huge files, but we had over 100.000 small files from 2 up to 100K.
This worked out fine for us.
We noticed that the DTU's never went up to their limit so we thought that the DB was not the problem, by firts splitting into 2 we saw DTU rising a bit more. we continued on that path....

Resources