How to Move large amount of data between Databases? - performance

I need to compare data from two databases (both of them are DB2) which are on different servers with no existing connection between them. Because both of the db's are used in production I don't want to overload them, therefore I will create a new db (probably MySQL) on my local machine, extract data from both of the DB2's, insert into MySQL and do the comparison locally.
I would like to do this in Java, so my question is how to do this task as effectively as possible, without overloading the production databases. I've done some research and came up with the following points:
limit the number of columns that I will use in my initial SELECT statement
tune the fetch size of the ResultSet object (the default for IBM DB2 JCC drivers seems to be 64)
use PreparedStatement object to pre-compile the SQL
Is there anything else that I can do, or any other suggestions?
Thank you

DB2 for Linux UNIX, and Windows includes the EXPORT utility as part of its runtime client. This utility can be pointed at a DB2 database on z/OS to quickly drain a table (or query result set) into a flatfile on your client machine. You can choose whether the flatfile is delimited, fixed width, or DB2's proprietary IXF format. Your z/OS DBA should be able to help you configure the client on your workstation and bind the necessary packages into the z/OS databases as required by the EXPORT utility.
Once you have the flatfiles on your client, you can compare them however you like.

Sounds like a great job for map reduce (hadoop). One job with two mappers, one for each DB and a reducer to do the compares. It can scale to as many processors as you need, or just run on a single machine.

Related

How can we do data analysis for DB replication project

We are facing one issue in our project i.e. Data verification issue.
The project is about Replication of data from Sybase to oracle DBs.
The table structures for Table A across Sybase, Oracle is same.
Same column and primary key combination across all the databases.
e.g. If Sybase has Table A with columns a, b and C
same table with same name and same columns will be available in different databses.
We are done with replication stuff part.But we faced some silent failure like data discrepancy just wondering if there will any tool already available for this.
Any information on his would be helpful. Thanks.
Sybase (now SAP) has a couple products that can be used for data comparisons and reconciliation:
rs_subcmp - an older, 32-bit tool that comes with the Sybase Replication Server product that can be used to compare data between
source and target; SQL reconciliation scripts can be generated from
the differences and then applied to the target to bring it in sync
with the source; if your tables are more than 1GB in size you can
still use rs_subcmp but you'll need to create multiple comparison
jobs (via where clauses) to work on different subsets of your tables
[I don't recall if rs_subcmp can be use for heterogeneous
replication setsup, eg, ASE-Oracle.]
Data Assurance (DA) - the newer, 64-bit product ... also from
Sybase ... which can also compare data and (re)sync the target(s)
from the source (either via SQL reconciliation scripts or directly);
DA is capable of handling comparisons between a handful of
different RDBMS products (eg, ASE-Oracle); I'm currently working on a
project where one of the requirements is to validate (and reconcile
where needed) 200+TB of data being migrated from Oracle to HANA and
I'm using DA for the validation/reconciliation portion of the project
As #TenG has hinted at with his answer, there's a good bit of effort involved to compare data and generate code to reconcile the differences. Rolling your own code is doable but will entail a lot of work. If you've got the money you'll likely find 3rd party tools can get most/all of the work done for you.
If you used a 3rd party product to replicate your data from Sybase to Oracle, you may want to see if the same vendor has a comparison/validation/reconciliation tool you could use.
I've worked on a few migration projects and a key part has always been data reconciliation.
I can only talk about the approaches we took, based on constraints around tools available and minimising downtime, and constraints of available space.
In all cases I took to writing scripts that worked on two levels - summary view and "deep dive". We couldn't find any tools readily available that did what we wanted in a timely enough manner. In fact even the migration tools we found had limitations (datapump, sqlloader, golden gate, etc) and hand coded scripts to handle the bits that we found to be lacking or too slow in the standard tools.
The summary view varied from project to project. It was part functional based (do the accounting figures for transactions match) for the users to verify, and part technical. For smaller tables we could just write simple reports and the diff was straight forward.
For larger tables we wrote technical reports that looked at bands of data (e.g group the PK into 1000s) collect all the column data and produce checksum, generating a report for each table like:
PK ID Range Start Checksum
----------------- -----------
100000 22773377829
200000 38938938282
.
.
Corresponding table pairs from each database were then were "diff"d against each other to highlight discrepancies. Any differences that were found could then be looked at in more detail.
The scripts were written in such a way to allow them to run in parallel looking at discrete bands. Te band ranges were tunable as well to get the best throughput. This obviously sped things up.
The scripts were shell scripts firing off sqlplus reports, and similar for the source database.
On one project there wasn't enough diskspace to do these reports, so I wrote a Java program that queried the two databases side by side, using block queues to fetch and compare rowsets. Being in memory meant this was super fast.
For the "deep dive" we looked at the details for key tables, or for tables that reports a checksum difference.
For the user reports, the users would specify what they wanted to see, and we wrote the reports accordingly.
On the last project, the only discrepancies found were caused by character set conversion issues (people names with accents weren't handled correctly).
On projects where the overall dataset was smaller we extracted the data to XML files and wrote a Java tool to processes pairs and report differences.
The SAP/Sybase rs_subcmp tool is pretty powerful and also pretty hard to use. For details see:
https://help.sap.com/viewer/075940003f1549159206fcc89d020515/16.0.3.3/en-US/feb58db1bd1c1014b134ef4efef25563.html?q=rs_subcmp
You have to pass it key field information, but once you do that, it can retry/restart the compare streams after transient differences. Pretty fancy.
rs_subcmp expects to work on Sybase data source. So to compare against Oracle, you'd probably have to setup one of those Sybase-to-Oracle gateway products ($$$$$).
Could you install the Oracle ODBC drivers and configure them to allow Sybase clients to access Oracle? I'm guessing not (but that's outside the range of my experience).
Note the "-h" option for rs_subcmp. The docs just say it runs a "fast comparison", but what it's actually doing is running queries using the hashbytes() function. Something like:
select keyfield1,keyfield2, hashbytes("Md5",datacol1,datacol2,datacol3)
from mytable
So this sort of query might be good for the "summary view" type comparison discussed above (if the Oracle STANDARD_HASH() function output matches up with the Sybase hashbytes() function (again, outside my experience))
Note, as of ASE 16, there was a bug with the hash() & hashbytes() functions running the Md5 hash option against large varbinary columns where they could use up all procedure cache, potentially crashing the server (CR 811073)

Dynamically List contents of a table in database that continously updates

It's kinda real-world problem and I believe the solution exists but couldn't find one.
So We, have a Database called Transactions that contains tables such as Positions, Securities, Bogies, Accounts, Commodities and so on being updated continuously every second whenever a new transaction happens. For the time being, We have replicated master database Transaction to a new database with name TRN on which we do all the querying and updating stuff.
We want a sort of monitoring system ( like htop process viewer in Linux) for Database that dynamically lists updated rows in tables of the database at any time.
TL;DR Is there any way to get a continuous updating list of rows in any table in the database?
Currently we are working on Sybase & Oracle DBMS on Linux (Ubuntu) platform but we would like to receive generic answers that concern most of the platform as well as DBMS's(including MySQL) and any tools, utilities or scripts that can do so that It can help us in future to easily migrate to other platforms and or DBMS as well.
To list updated rows, you conceptually need either of the two things:
The updating statement's effect on the table.
A previous version of the table to compare with.
How you get them and in what form is completely up to you.
The 1st option allows you to list updates with statement granularity while the 2nd is more suitable for time-based granularity.
Some options from the top of my head:
Write to a temporary table
Add a field with transaction id/timestamp
Make clones of the table regularly
AFAICS, Oracle doesn't have built-in facilities to get the affected rows, only their count.
Not a lot of details in the question so not sure how much of this will be of use ...
'Sybase' is mentioned but nothing is said about which Sybase RDBMS product (ASE? SQLAnywhere? IQ? Advantage?)
by 'replicated master database transaction' I'm assuming this means the primary database is being replicated (as opposed to the database called 'master' in a Sybase ASE instance)
no mention is made of what products/tools are being used to 'replicate' the transactions to the 'new database' named 'TRN'
So, assuming part of your environment includes Sybase(SAP) ASE ...
MDA tables can be used to capture counters of DML operations (eg, insert/update/delete) over a given time period
MDA tables can capture some SQL text, though the volume/quality could be in doubt if a) MDA is not configured properly and/or b) the DML operations are wrapped up in prepared statements, stored procs and triggers
auditing could be enabled to capture some commands but again, volume/quality could be in doubt based on how the DML commands are executed
also keep in mind that there's a performance hit for using MDA tables and/or auditing, with the level of performance degradation based on individual config settings and the volume of DML activity
Assuming you're using the Sybase(SAP) Replication Server product, those replicated transactions sent through repserver likely have all the info you need to know which tables/rows are being affected; so you have a couple options:
route a copy of the transactions to another database where you can capture the transactions in whatever format you need [you'll need to design the database and/or any customized repserver function strings]
consider using the Sybase(SAP) Real Time Data Streaming product (yeah, additional li$ence is required) which is specifically designed for scenarios like yours, ie, pull transactions off the repserver queues and format for use in downstream systems (eg, tibco/mqs, custom apps)
I'm not aware of any 'generic' products that work, out of the box, as per your (limited) requirements. You're likely looking at some different solutions and/or customized code to cover your particular situation.

How much data/networking usage does an Oracle Client use while running queries across schemas?

When I run a query to copy data from schemas, does it perform all SQL on the server end or copy data to a local application and then push it back out to the DB?
The two tables sit in the same DB, but the DB is accessed through a VPN. Would it change if it was across databases?
For instance (Running in Toad Data Point):
create table schema2.table
as
select
sum(row1)
,row2
from schema1
The purpose I ask the question is because I'm getting quotes for a Virtual Machine in Azure Cloud and want to make sure that I'm not going to break the bank on data costs.
The processing of SQL statements on the same database usually takes place entirely on the server and generates little network traffic.
In Oracle, schemas are a logical object. There is no physical barrier between them. In a SQL query using two tables it makes no difference if those tables are in the same schema or in different schemas (other than privilege issues).
Some exceptions:
Real Application Clusters (RAC) - RAC may share a huge amount of data between the nodes. For example, if the table was cached on one node and the processing happened on another, it could send all the table data through the network. (I'm not sure how this works on the cloud though. Normally the inter-node traffic is done with a separate, dedicated network connection.)
Database links - It should be obvious if your application is using database links though.
Oracle Reports and Forms(?) - A few rare tools have client-side PL/SQL processing. Possibly those programs might send data to the client for processing. But I still doubt it would do something crazy like send an entire table to the client to be sorted, and then return the results to the server.
Backups/archive logs - I assume all the data will be backed up. I'm not sure how that's counted, but possibly that means all data written will also be counted as network traffic eventually.
The queries below are examples of different ways to check the network traffic being generated.
--SQL*Net bytes sent for a session.
select *
from gv$sesstat
join v$statname
on gv$sesstat.statistic# = v$statname.statistic#
--You probably also want to filter for a specific INST_ID and SID here.
where lower(display_name) like '%sql*net%';
--SQL*Net bytes sent for the entire system.
select *
from gv$sysstat
where lower(name) like '%sql*net%'
order by value desc;

Logical grouping schemas in ORACLE?

We are planning a new system for a client in ORACLE 11g. I've been mostly in the Sql Server world for several years, and am not really current on the latest ORACLE updates.
One particular feature I'm wondering if ORACLE has added in by this point is some sort of logical "container" for database objects, akin to Sql Server's SCHEMA.
Trying to use ORACLE's schemas like Sql Server winds up being a disaster for code comparisons when trying to push from dev > test > live.
Packages are sort of similar, except that you can't put tables into a package (so they really only work for logical code grouping).
The only other option I am aware of is the archaic practice of having to prefix object names with a "schema" prefix, i.e. RPT_REPORTS, RPT_PARAMETERS, RPT_LOGS, RPT_USERS, RPT_RUN_REPORT(), with the prefix RPT_ denoting that these are all the objects dealing with our reporting engine say. Writing a system like this feels like we never left the 8.3 file-naming age.
Is there by this point in time any cleaner, more direct way of logically grouping related objects together in ORACLE?
Oracle's logical container for database objects IS the schema. I don't know how much "cleaner" and "more direct" you can get! You are going to have to do a paradigm shift here. Don't try to think in SQL Server terms, and force a solution that looks like SQL Server on Oracle. Get familiar with what Oracle does and approach your problems from that perspective. There should be no problem pushing from dev to test to production in Oracle if you know what you're doing.
It seems you have a bit of a chip on your shoulder about Oracle when you use terms like "archaic practice". I would suggest you make friends with Oracle's very rich and powerful feature set by doing some reading, since you're apparently already committed to Oracle for this project. In particular, pick up a copy of "Effective Oracle By Design" by Tom Kyte. Once you've read that, have a look at "Expert Oracle Database Architecture" by the same author for a more in-depth look at how Oracle works. You owe it to your customer to know how to use the tool you've been handed. Who knows? You might even start to like it. Think of it as another tool in your toolchest. You're not married to SQL Server and you're not being unfaithful by using Oracle ;-)
EDIT:
In response to questions by OP:
I'm not sure why that is a logistical problem. They can be thought of as separate databases, but physically they are not. And no, you do not need a separate data file for each schema. A single datafile is often used for all schemas.
If you want a "nice, self-contained database" ala SQL Server, just create one schema to store all your objects. End of problem. You can create other users/schemas, just don't give them the ability to create objects.
There are tools to compare objects and data, as in the PL/SQL Developer compare. Typically in Oracle you want to compare schemas, not entire databases. I'm not sure why it is you want to have multiple schemas each with their own objects anyway. What does is buy you to do that? Keep your objects (tables, triggers, code, views, etc.) in one schema.

load data into text file from oracle database views

I want to load data into text file that is generated after executing "views" in Oracle?How can I achieve this in oracle using UNIX.for example-
I want the same in Oracle on unix box.Please help me out as it alredy cosume lots of time.
your early response is highly appreciated!!
As Thomas asked, we need to know what you are doing with the "flat file". For example, if you're loading it into spreadsheet or doing some other processing that expects a defined format, then you need to use SQL*Plus and spool to a file. If you're looking to save a table (data + table definition) for moving it to another Oracle database then EXP/IMP is the tool to use.
We generally describe the data retrieval process as "selecting" from a table/view, not "executing" a table/view.
If you have access to directories on the database server, and authority to create "Directory" objects in Oracle, then you have lots of options.
For example, you can use the UTL_FILE package (part of the PL/SQL built-ins) to read or write files at the operating system level.
Or use the "external table" functionality to define objects that look like single tables to Oracle but are actually flat files at the OS level. Well documented in the Oracle docs.
Also, for one-time tasks, most of the tools for working SQL and PL/SQL provide facilities for moving data to and from the database. In the Windows environment, Toad's good at that. So is Oracle's free SQLDeveloper, which runs on many platforms. You wouldn't want to use those for a process that runs every day, but they're fine for single moves. I've generally found these easier to use than SQLPlus spooling, but that's a primitive version of the same functionality.
As stated by others, we need to know a bit more about what you're trying to do.

Resources