Manually logging database event in datastage job - etl

i have a parallel job that writes in oracle table. I want to manually write warnings in Datastage's log if some event occur. For example if a certain value for a certain column is inserted i want to track this information in the log. Could this be achieved somehow?

To write custom messages into the logs for a particular jobs data stream, you can use a combination of a copy stage, transformer, and peak stage. The peak stage is the one that writes to the logs. I like to set the peak stage to run in sequential mode, so that your messages are kept together in single entries in the log, instead across nodes.
Also, you can peak the rejects of the oracle stage. maybe combine this with the above option (using a funnel stage and a standard column schema).
Lastly, if you'd actually like to query the logs themselves and write those logs out somewhere else or use them in a job (amoungst allother data kept about jobs in the repository). You can directly query the DSODB schema in the XMETA database. I.e. the DataStage repository (by default DB2).
You would need to have the DataStage Operations Console up and running for that (not sure what version of DataStage you're running). If DataStage is running on a single tier and using the default DB2 database. You can simply catalog the DSODB database so that it's available as a connection in the DB2 connector. Else you'd need to install a DB2 client on the DataStage engine tier and catalog the database there.
All the best!
Twitter: #InforgeAcademy
DataStage tips and Tricks: https://www.inforgeacademy.com/blog/

Related

AWS DMS Error when trying to replicate Oracle to PostgreSQL

I'm trying to replicate several schemas in a Oracle database to a PostgresSQL database.
When the DMS task is started with Full load, ongoing replication type the task fails after sometimes while the tables are in the Before Load status. This is the error I'm getting when the task fails
Last Error Task error notification received from subtask 0, thread 0 [reptask/replicationtask.c:2673] [1022301]
Oracle CDC stopped; Error executing source loop; Stream component failed at subtask 0,
component st_0_LBI2ND3ZI65BF6DQQYK4ITPYAY ; Stream component 'st_0_LBI2ND3ZI65BF6DQQYK4ITPYAY'
terminated [reptask/replicationtask.c:2680] [1022301] Stop Reason FATAL_ERROR Error Level FATAL
However when the same tables are added to a task with Full Load type it works without any issue. The error occurs only when trying to run the task for replicating ongoing changes.
I tried searching for this error but couldn't find a exact reason. I have configured the endpoints properly and both source and target endpoints have the required permissions for replicating changes. How can I get this resolved?
For the replication to work properly you need to enable SUPPLEMENTAL LOGGING across all the required tables in your source DB
So this can be due to multiple reasons. Although the basic cause remains the same, DMS is not able to read the logs in your oracle database and it times out.
Before proceeding forward I assume you have followed all steps mentioned in aws documentation for CDC setup here.
As mentioned in above answer the Supplemental logging should be enabled on
database level as well as for all columns and primary keys at table level ex:
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (PRIMARY KEY) COLUMNS; ALTER
TABLE schema_name.table_name ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
ALTER table PCUSER.PC_POLICY ADD SUPPLEMENTAL LOG DATA (PRIMARY KEY)
COLUMNS;
The log retention period should be enough so that CDC ke read the logs before deleted. Here is the troubleshooting link for this issue on aws docs.
The DMS user that you are using should have read/write/alter access for all the schemas you are trying to read from. In my case it happened several times, that afer adding new tables to the schema I got this error again as the user I was using did not have the access to read newly added tables.
Also it depends on, what are you using to mine the logs. If it is LogMiner the setup is quite simple, for binary there are few extra commands you need to execute. Which are mentioned in the setup documentation.
Login to the database using the same user, you are using on DMS and check if the redo logs exists at-
SELECT * FROM V$ARCHIVED_LOG;
Also check for the DEST_ID, highlighted in the above screenshot. As far as I read the default value is 0 on DMS. You can check this for your database and add set it in the extra connection attributes-
archivedLogDestId=1;
Check if there are multiple DEST_ID's for your logs, for example if you see the DEST_ID as 1, as in above screenshot, confirm using-
SELECT * FROM V$ARCHIVED_LOG WHERE dest_id NOT IN (1)
This should return nothing, but if this return records, copy those extra
DEST_ID's and paste them in below connection attribute-
additionalArchivedLogDestId=[0,2,3, ...,n]
Finally if this doesn't work, enable detailed debug logging, here how you can . In our case the logminer and thus the DMS user did not have the access to read the redo logs.
Few extra connection attributes that I used may help you for logminer-
addSupplementalLogging=Y;useLogminerReader=Y;archivedLogDestId=1;additionalArchivedLogDestId=[0,2,3];ignoreTxnCtxValidityCheck=false;cdcTimeout=1200

IMPDP uses more disk space than expected

Background:
I've been tasked with importing a large amount of data from a production database to a test database (Oracle 12c release 2 running on RHEL) and I'm using Data Pump.
The first time I imported the tables, The tables were created and the data was imported as planned, but - due to an issue in my data pump parameter file - the constraints were not imported.
My subsequent attempts did not go as well, however. Data Pump began to freeze part way and the STATUS command showed that no bytes were being processed.
My Solution Attempts:
I tried using TABLE_EXISTS_ACTION=REPLACE and dropping the tables directly after an attempt. I also dropped the master tables of any data pump jobs I was unable to terminate from the utility.
Still, it seemed to hang earlier and earlier in the process as I continuously tried to import the tables. df -h returned 100% disk usage every time it hanged.
The dump file itself is on a separate drive so it's no longer taking up room. I've been trying to clear out space but it keeps filling up when I run a job and I can't tell where. Oracle flashbacks are disabled and I made sure to purge the oracle recycle bin.
tl;dr:
Running impdp jobs seems to use up disk space beyond the added tables and the job master tables. Where is this memory getting used up and what can I do to clear it for a succesful import?
I figured out the problem:
The database was in archivelog mode in order to set up streams and recovery manager backups. As a result, impdp was causing a flood of archived database changes.
In order to clear out the old archives I ran the following in rman for every database in noarchivelog mode on the server.
connect target /
run {
allocate channel c1 type disk;
delete force noprompt archivelog until time 'SYSDATE-30';
release channel c1;
}
This cleared up 60 gigabytes. I also added the parameter transform=disable_archive_logging:Y to my impdp parameter file. This suppresses archive creation when running data pump imports.

configured oracle goldengate DML replication both are oracle database

I had configure oracle D ML replication using oracle golden gate successfully but is there any query to check source and target is in sync or not or how to verify IT.
No replication tool has the functionality to check if the database is in sync. The idea of asynchronous replication is that it is never fully in sync - the target is always late compared to the source database. Only fully synchronized disk replication allows a full in sync copy of the data.
You might want check if the "not recently changed" data is the same using a compare-every-row technology. Oracle has a product called Veridata which can do the job.
You might also want to check if the replication stream is working (it is not stopped). But this check does not check if the data is in sync. Someone might modify the target data and you might not be able to check that. The heartbeat technology just checks if the replication stream is not broken. OGG 12.2 has special build in commands for that.
Please check:
ADD HEARTBEATTABLE command for ggsci
ENABLE_HEARTBEAT_TABLE parameter for processes

Manipulating Data Within AWS Redshift to a Schedule

Current Setup:
SQL Server OLTP database
AWS Redshift OLAP database updated from OLTP
via SSIS every 20 minutes
Our customers only have access to the OLAP Db
Requirement:
One customer requires some additional tables to be created and populated to a schedule which can be done by aggregating the data already in AWS Redshift.
Challenge:
This is only for one customer so I cannot leverage the core process for populating AWS; the process must be independent and is to be handed over to the customer who do not use SSIS and don't wish to start. I was considering using Data Pipeline but this is not yet available in the market in which the customer resides.
Question:
What is my alternative? I am aware of numerous partners who offer ETL like solutions but this seems over the top, ultimately all I want to do is execute a series of SQL statements on a schedule with some form of error handling/ alert. Preference of both customer and management is to not use a bespoke app to do this, hence the intended use of Data Pipeline.
For exporting data from AWS Redshift to another data source using datapipeline you can follow a template similar to https://github.com/awslabs/data-pipeline-samples/tree/master/samples/RedshiftToRDS using which data can be transferred from Redshift to RDS. But instead of using RDSDatabase as the sink you could add a JdbcDatabase (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-jdbcdatabase.html). The template https://github.com/awslabs/data-pipeline-samples/blob/master/samples/oracle-backup/definition.json provides more details on how to use the JdbcDatabase.
There are many such templates available in https://github.com/awslabs/data-pipeline-samples/tree/master/samples to use as a reference.
I do exactly the same thing as you, but I use lambda service to perform my ETL. One drawback of lambda service is, it can run max of 5 mins (Initially 1 min) only.
So for ETL's greater than 5 minutes, I am planning to set up PHP server in AWS and with SQL injection I can run my queries, scheduled at any time with help of cron function.

Oracle to Neo4j Sync

Do we have any utility to sync data between Oracle & Neo4J database. I want to use Neo4j in readonly mode & all writes will happen to oracle DB.
I think this depends on how often you want to have the data synced. Are you looking for a periodic sync/ETL process (say hourly or daily), or are looking for live updates into Neo4j?
I'm not aware of tools designed for this, but it's not terribly difficult to script yourself.
A periodic sync is obviously easiest. You can do that directly using the Java API and connecting via JDBC to Oracle. You could also just dump the data from Oracle as a CSV and import into Neo4j. This would be done similiarly to how data is imported from PostreSQL in this article: http://neo4j.com/developer/guide-importing-data-and-etl/
There is a SO response for exporting data from Oracle using sqlplus/spool:
How do I spool to a CSV formatted file using SQLPLUS?
If you're looking for live syncing, you'd probably do this either through monitoring the transaction log or by adding triggers onto your tables, depending on the complexity of your data.

Resources