We are currently trying to use Sqoop to ingest data from Hadoop to Azure SQL Data Warehouse but getting error related to Transaction isolation level. What's happening is that Sqoop tries to set transaction isolation level to READ COMMITTED while trying to import/export whereas this feature isn't currently supported in Azure SQL Data warehouse. I've tried using --relaxed-isolation parameter of Sqoop but still no effect.
As a solution, I am thinking to:
1. Change Sqoop source code to alter Sqoop's behavior to not set transaction level
2. Look for APIs (if any) that may allow me to change this Sqoop's behavior programmatically.
Has anyone encountered such scenario? Looking for suggestions for the proposed solutions and how to go about them.
This issue has just been resolved in Sqoop: https://issues.apache.org/jira/browse/SQOOP-2349
Otherwise, #wBob's comment about using Polybase is definitely best practice: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-azure-sql-data-warehouse-connector#use-polybase-to-load-data-into-azure-sql-data-warehouse
Related
We want to take tables from Oracle to Cassandra every day. Because tables is updated in Oracle everyday. So when i searched this , i find these options:
Extract oracle tables as a file , then write Cassandra
Using sqoop to get tables from oracle, write Map Reduce job and insert into Cassandra ?
I am not sure which way is the appropriate ? Also is there another options ?
Thank you.
Option 1
Extracting oracle tables as a file and then writing to Cassandra manually everyday can be tiresome process unless if you are scheduling a cron job. I have tried this before, but if the process fails then logging it might be an issue. If you are using this process and exporting to CSV and trying to write to cassandra then I would suggest using cassandra bulk loader (https://github.com/brianmhess/cassandra-loader)
Option 2
I haven't worked with this, so can't speak about this.
Option 3 (I use this)
I use an open source tool, Pentaho Data Integration (Spoon) (https://community.hitachivantara.com/docs/DOC-1009855-data-integration-kettle) to solve this problem. It's fairly a simple process
spoon. You can automate this process by using a carte server (spoon server) which has logging capabilities as well as automatic restarting if the process failed in between.
Let me know if you found any other solution that worked for you.
i have a parallel job that writes in oracle table. I want to manually write warnings in Datastage's log if some event occur. For example if a certain value for a certain column is inserted i want to track this information in the log. Could this be achieved somehow?
To write custom messages into the logs for a particular jobs data stream, you can use a combination of a copy stage, transformer, and peak stage. The peak stage is the one that writes to the logs. I like to set the peak stage to run in sequential mode, so that your messages are kept together in single entries in the log, instead across nodes.
Also, you can peak the rejects of the oracle stage. maybe combine this with the above option (using a funnel stage and a standard column schema).
Lastly, if you'd actually like to query the logs themselves and write those logs out somewhere else or use them in a job (amoungst allother data kept about jobs in the repository). You can directly query the DSODB schema in the XMETA database. I.e. the DataStage repository (by default DB2).
You would need to have the DataStage Operations Console up and running for that (not sure what version of DataStage you're running). If DataStage is running on a single tier and using the default DB2 database. You can simply catalog the DSODB database so that it's available as a connection in the DB2 connector. Else you'd need to install a DB2 client on the DataStage engine tier and catalog the database there.
All the best!
Twitter: #InforgeAcademy
DataStage tips and Tricks: https://www.inforgeacademy.com/blog/
I had configure oracle D ML replication using oracle golden gate successfully but is there any query to check source and target is in sync or not or how to verify IT.
No replication tool has the functionality to check if the database is in sync. The idea of asynchronous replication is that it is never fully in sync - the target is always late compared to the source database. Only fully synchronized disk replication allows a full in sync copy of the data.
You might want check if the "not recently changed" data is the same using a compare-every-row technology. Oracle has a product called Veridata which can do the job.
You might also want to check if the replication stream is working (it is not stopped). But this check does not check if the data is in sync. Someone might modify the target data and you might not be able to check that. The heartbeat technology just checks if the replication stream is not broken. OGG 12.2 has special build in commands for that.
Please check:
ADD HEARTBEATTABLE command for ggsci
ENABLE_HEARTBEAT_TABLE parameter for processes
After restarting the Impala server, we are not able to see the tables(i.e. tables are not coming up).Anyone help me what order we have to follow to avoid this issue.
Thanks,
Srinivas
You should try running "invalidate metadata;" from impala-shell. This usually clears up tables not being visible as impala caches metadata.
From:
https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_invalidate_metadata.html
The following example shows how you might use the INVALIDATE METADATA
statement after creating new tables (such as SequenceFile or HBase tables) through the Hive shell. Before the INVALIDATE METADATA statement was issued, Impala would give a "table not found" error if you tried to refer to those table names.
(I do not code on my own, to make things clear)
I am looking for a solution that would allow to replicate data between a, master, Oracle 11g DB and a new PostgreSQL DB. Those are 2 different applications but the need to exchange data in real-time. There are some trigger-based ways but there is quite a big concern that this can affect the master DB efficiency - which we can't do.
I have also come across some log-based solutions, like HVR, but the cost is way too high for 500MB of data to be replicated.
Maybe anyone of You had a similar issue and found a way to deal with it?
Any kind of tips and help will be really appreciated as I am quite short on time
Oracle Archive Logs have different format than Postgres Write Ahead Logs. Despite the general similarity in concept of Oracle Streams, SQL Log Shipping, Postgres Streaming Replication etc, transaction logs <> redo logs <> xlogs and you can't use one provider logs to roll on the other provider engine.
Moreover you can't roll logs over same DB provider different version because of difference in binary format.
Something alike logical replication you can get with Postgres Logical Decoding, Oracle GoldenGate, Heterogeneous Database Replication, AWS DMS. But none of above gives you "Log-Based replication" between different db vendors
You can use a product that specializes in change data capture based data integration. Striim, GoldenGate, Attunity allow you to do CDC from Oracle. Striim also allows you to do CDC from PostgreSQL and write to Oracle as well.
https://striim.com
https://attunity.com