Oracle connection with Spark SQL - oracle

I'm trying to connect to Oracle DB from Spark SQL with following code:
val dataTarget=sqlcontext.read.
  format("jdbc").
  option("driver", config.getString("oracledriver")).
  option("url", config.getString("jdbcUrl")).
  option("user", config.getString("usernameDH")).
  option("password", config.getString("passwordDH")).
  option("dbtable", targetQuery).
  option("partitionColumn", "ID").
  option("lowerBound", "5").
  option("upperBound", "499999").
  option("numPartitions", "10").
  load().persist(StorageLevel.DISK_ONLY)
By default when we connect with Oracle through Spark SQL it'll create one connection for one partition will be created for the entire RDD. This way I loose parallelism and performance issues comes when there is huge data in a Table. In my code I have passed option("numPartitions", "10")
which will create 10 connection. Please correct if I'm wrong as I know, the number of connections with Oracle will be equal to the number of partitions we pass.
I'm getting below error if I use more connection because may be there is a connection limit to Oracle.
java.sql.SQLException: ORA-02391: exceeded simultaneous
SESSIONS_PER_USER limit
To create more partitions for parallelism if I use more partitions, error comes but if I put less I face performance issues. Is there any other way to create a single connection and load data into multiple partitions (this will save my life).
Please suggest.

Is there any other way to create a single connection and load data into multiple partitions
There is not. In general partitions are processed by different physical nodes and different virtual machines. Considering all the authorization and authentication mechanisms, you cannot just take connection and pass it from node to node.
If problem is just in exceeding SESSIONS_PER_USER just contact the DBA and ask for increasing the value for the Spark user.
If problem is throttling you can try to keep the same number partitions, but decrease number of Spark cores. Since this is mostly micromanaging it might be better to drop JDBC completely, use standard export mechanism (COPY FROM) and read the files directly.

One work around might be to load the data using a single Oracle connection (partition) and then simply repartition:
val dataTargetPartitioned = dataTarget.repartition(100);
You can also partition by a field (if partitioning a dataframe):
val dataTargetPartitioned = dataTarget.repartition(100, "MY_COL");

Related

How to incresae speed of bulk insert from Oracle to IBM DB2 via Apache NiFi

I am just curious of ways to better tune for speed bulk inserts via apache nifi. I am just curious if a different driver or other configurations could speed up the process. Any inputs or references to resources would be greatly appreciated!
This is my current flow with configurations included in pictures, Source DB is Oracle, Destination DB is IBM db2 z/Os:
I think you have a few things working against you:
You probably have low concurrency set on the PutDatabaseRecord processor.
You have a very large fetch size.
You have a very large record-per-flowfile count.
From what I've read in the past, the fetch size controls how many records will be pulled from the query's remote result in each iteration. So in your case, it has to pull 100k records before it will even register data being ready. Try dropping it down to 1k records for the fetch and experiment with 100-1000 records per flowfile.
If you're bulk inserting that flowfile, you're also sending over 100k inserts at once.

How much data/networking usage does an Oracle Client use while running queries across schemas?

When I run a query to copy data from schemas, does it perform all SQL on the server end or copy data to a local application and then push it back out to the DB?
The two tables sit in the same DB, but the DB is accessed through a VPN. Would it change if it was across databases?
For instance (Running in Toad Data Point):
create table schema2.table
as
select
sum(row1)
,row2
from schema1
The purpose I ask the question is because I'm getting quotes for a Virtual Machine in Azure Cloud and want to make sure that I'm not going to break the bank on data costs.
The processing of SQL statements on the same database usually takes place entirely on the server and generates little network traffic.
In Oracle, schemas are a logical object. There is no physical barrier between them. In a SQL query using two tables it makes no difference if those tables are in the same schema or in different schemas (other than privilege issues).
Some exceptions:
Real Application Clusters (RAC) - RAC may share a huge amount of data between the nodes. For example, if the table was cached on one node and the processing happened on another, it could send all the table data through the network. (I'm not sure how this works on the cloud though. Normally the inter-node traffic is done with a separate, dedicated network connection.)
Database links - It should be obvious if your application is using database links though.
Oracle Reports and Forms(?) - A few rare tools have client-side PL/SQL processing. Possibly those programs might send data to the client for processing. But I still doubt it would do something crazy like send an entire table to the client to be sorted, and then return the results to the server.
Backups/archive logs - I assume all the data will be backed up. I'm not sure how that's counted, but possibly that means all data written will also be counted as network traffic eventually.
The queries below are examples of different ways to check the network traffic being generated.
--SQL*Net bytes sent for a session.
select *
from gv$sesstat
join v$statname
on gv$sesstat.statistic# = v$statname.statistic#
--You probably also want to filter for a specific INST_ID and SID here.
where lower(display_name) like '%sql*net%';
--SQL*Net bytes sent for the entire system.
select *
from gv$sysstat
where lower(name) like '%sql*net%'
order by value desc;

Oracle db table data loading is too slow in DBeaver

I'm using DBeaver to connect to an Oracle database. Database connection and table properties view functions are working fine without any delay. But fetching table data is too slow(sometimes around 50 seconds).
Any settings to speed up fetching table data in DBeaver?
Changing following settings in your oracle db connection will be faster fetching table data than it's not set.
Right click on your db connection --> Edit Connection --> Oracle properties --> tick on 'Use RULE hint for system catalog queries'
(by default this is not set)
UPDATE
In the newer version (21.0.0) of DBeaver, many more performance options appear here. Turning on them significantly improves the performance for me
I've never used DBeaver, but I often see applications which use too small an "array fetch size"**, which often poses fetch issues.
** Array fetch size note:
As per the Oracle documentation the Fetch Buffer Size is an application side memory setting that affects the number of rows returned by a single fetch. Generally you balance the number of rows returned with a single fetch (a.k.a. array fetch size) with the number of rows needed to be fetched.
A low array fetch size compared to the number of rows needed to be returned will manifest as delays from increased network and client side processing needed to process each fetch (i.e. the high cost of each network round trip [SQL*Net protocol]).
If this is the case, you will likely see very high waits on “SQLNet message from client” [in gv$session or elsewhere].
SQLNet message from client
This wait event is posted by the session when it is waiting for a message from the client to arrive. Generally, this means that the session is just sitting idle, however, in a Client/Server environment it could also means that either the client process is running slow or there are network latency delays. The database performance is not degraded by high wait times for this wait event.

How to replicate, for DR purposes, a Greenplum DB to another data centre?

We are planning for a large Greenplum DB (growing from 10 to 100TB over the first 18 months). Traditional backup and restore tools aren't going to help as we have 24hr RPO/RTOs to deal with.
Is there a way to replicate the DB across to our DR site without resorting to block replication (i.e. place a segment on SAN and mirror)?
You've got a number of options to choose:
Dual ETL. Replicate input data and run the same ETL on two sites. Synchronize them with backup-restore every week or so
Backup-restore. Simple backup-restore can be not that efficient. But if you use DataDomain it can perform deduplication on the block level and store only changed blocks. It can offload the deduplication task to run on the Greenplum cluster (DDBoost). Also in case of replication to remote site it would replicate only changed blocks, which would greatly reduce replication time. In my experience, if clean backup on DD takes 12 hours, subsequent DDBoost backup will take 4 hours + 4 hours to replicate the data
Custom solution. I know the case when the data replicatioin to remote site is made as a part of ETL process. For the ETL job you know which tables are changed, they are added to the replication queue and moved to the remote site using external tables. Analysts are working in a special sandbox and their sandbox is replicated with backup-restore daily
At the moment Greenplum does not have built-in WAN replication solution so this is almost all the options to choose from.
I do some investigation on this. here is my result
I. Using EMC Symmetrix VMAX SAN(Storage Area Network) Mirror and SRDF (Symmetrix Remote Data Facility) remote replication software
Please refer to h12079-vnx-replication-technologies-overview-wp.pdf for details
Preconditions
1. Having EMC Symmetrix VMAX SAN installed
2. Having SRDF softeware
Advantages of 3 different modes
1. Symmetrix Remote Data Facility / Synchronous (SRDF/S)
Provides a no data loss solution (Zero RPO).
No server resource contention for remote mirroring operation.
Can perform restoration of primary site with minimal impact to application. Performance on remote site. Enterprise
disaster recovery solution. Supports replicating over IP and Fibre
Channel protocols.
2. Symmetrix Remote Data Facility / Asynchronous (SRDF/A) Extended-distance data replication that supports longer distances
than SRDF/S. SRDF/A does not affect host performance, because host
activity is decoupled from the remote copy process. Efficient link
utilization that results in lower link-bandwidth requirements.
Facilities to invoke failover and restore operations. Supports
replicating over IP and Fibre Channel protocols.
3. Symmetrix Remote Data Facility / Data Mobility (SRDF/DM)
II. Using Backup Tools
Please refer to http://gpdb.docs.pivotal.io/4350/admin_guide/managing/backup.html for details
Parallel Backup
Parallel backup utility gpcrondump
Non-parallel backup
It is not recommended. It is used for migrate PostgreSQL databases to GreenPlum databases
Parallel Restore
Support system with the same configuration and different configuration with the source GreenPlum database configuration
Non-Parallel Restore
pg_restore requires to modified the create statement to add distributed by clause
 
Disadvantages
1. The backup process locks table, it put an EXCLUSIVE lock on table pg_class. It means that read permission is only allowed in this period.
2. After releasing the EXCLUSIVE lock on table pg_clas, it will put an ACCESS SHARE lock on all the tables, it only allows read access during the lock period.
III. Replay DDL statements
In PostgreSQL, there is a parameters to log all the sql statements to a file.
In the data/postgresql.conf, modify log_statement to ‘all’
Write an application to get the DML and DDL statement, and run them in the DR servers.
Advantage
1. Easy to configure and maintain
2. No decrease in the performance
Disadvantage
1. Need additional storage for the statement logging
IV. Parse the wal log of PostgreSQL
Parse the wal log to extract the DDL statement from the log and then run all the generated DDL statements in the DR GreenPlum
Advantage
1. Doesn’t impact the source GreenPlum Database
Disadvantage
1. Write code to parse the wal log
2. Not easy to parse the log, there are not enough documents about the wal log.
3. Don’t know if it is feasible for GreenPlum, as it is one solution for PostgreSQL.

How to Move large amount of data between Databases?

I need to compare data from two databases (both of them are DB2) which are on different servers with no existing connection between them. Because both of the db's are used in production I don't want to overload them, therefore I will create a new db (probably MySQL) on my local machine, extract data from both of the DB2's, insert into MySQL and do the comparison locally.
I would like to do this in Java, so my question is how to do this task as effectively as possible, without overloading the production databases. I've done some research and came up with the following points:
limit the number of columns that I will use in my initial SELECT statement
tune the fetch size of the ResultSet object (the default for IBM DB2 JCC drivers seems to be 64)
use PreparedStatement object to pre-compile the SQL
Is there anything else that I can do, or any other suggestions?
Thank you
DB2 for Linux UNIX, and Windows includes the EXPORT utility as part of its runtime client. This utility can be pointed at a DB2 database on z/OS to quickly drain a table (or query result set) into a flatfile on your client machine. You can choose whether the flatfile is delimited, fixed width, or DB2's proprietary IXF format. Your z/OS DBA should be able to help you configure the client on your workstation and bind the necessary packages into the z/OS databases as required by the EXPORT utility.
Once you have the flatfiles on your client, you can compare them however you like.
Sounds like a great job for map reduce (hadoop). One job with two mappers, one for each DB and a reducer to do the compares. It can scale to as many processors as you need, or just run on a single machine.

Resources