I want an overall better way to copy all the tables and their data from one production database schema and put it into a dev database schema on a different database in a different subnet using bash for unloads and loads.
It is important that the schema name on the dev database is and can be different.
Table structure for both schemas is the same, only the database name, schema name and data changes.
It is important that the solution requires minimum manual manipulation. Copying files across manually is acceptable but editing file contents to change data is not, unless this can be scripted to do automatically.
Currently we run a very long series of scripted exports for each table individually to ixf lobs, followed by a very long series of carefully placed scripted loads, being careful to load the data in order, parents before child.
Unload example:
export to CLIENT.ixf of ixf lobs to $LOCATION lobfile CLIENT_lobs modified by lobsinfile select * from CLIENT;
Load example:
load from CLIENT.ixf of ixf lobs from $LOCATION modified by lobsinfile replace into CLIENT statistics no copy no indexing mode autoselect allow no access check pending cascade deferred;
I have looked at db2move, but I cannot find how to specify database and schema name during load as it appears to only be supported in the unload/export.
db2look looks promising, but does this export the data too or just the table names?
Related
It's kinda real-world problem and I believe the solution exists but couldn't find one.
So We, have a Database called Transactions that contains tables such as Positions, Securities, Bogies, Accounts, Commodities and so on being updated continuously every second whenever a new transaction happens. For the time being, We have replicated master database Transaction to a new database with name TRN on which we do all the querying and updating stuff.
We want a sort of monitoring system ( like htop process viewer in Linux) for Database that dynamically lists updated rows in tables of the database at any time.
TL;DR Is there any way to get a continuous updating list of rows in any table in the database?
Currently we are working on Sybase & Oracle DBMS on Linux (Ubuntu) platform but we would like to receive generic answers that concern most of the platform as well as DBMS's(including MySQL) and any tools, utilities or scripts that can do so that It can help us in future to easily migrate to other platforms and or DBMS as well.
To list updated rows, you conceptually need either of the two things:
The updating statement's effect on the table.
A previous version of the table to compare with.
How you get them and in what form is completely up to you.
The 1st option allows you to list updates with statement granularity while the 2nd is more suitable for time-based granularity.
Some options from the top of my head:
Write to a temporary table
Add a field with transaction id/timestamp
Make clones of the table regularly
AFAICS, Oracle doesn't have built-in facilities to get the affected rows, only their count.
Not a lot of details in the question so not sure how much of this will be of use ...
'Sybase' is mentioned but nothing is said about which Sybase RDBMS product (ASE? SQLAnywhere? IQ? Advantage?)
by 'replicated master database transaction' I'm assuming this means the primary database is being replicated (as opposed to the database called 'master' in a Sybase ASE instance)
no mention is made of what products/tools are being used to 'replicate' the transactions to the 'new database' named 'TRN'
So, assuming part of your environment includes Sybase(SAP) ASE ...
MDA tables can be used to capture counters of DML operations (eg, insert/update/delete) over a given time period
MDA tables can capture some SQL text, though the volume/quality could be in doubt if a) MDA is not configured properly and/or b) the DML operations are wrapped up in prepared statements, stored procs and triggers
auditing could be enabled to capture some commands but again, volume/quality could be in doubt based on how the DML commands are executed
also keep in mind that there's a performance hit for using MDA tables and/or auditing, with the level of performance degradation based on individual config settings and the volume of DML activity
Assuming you're using the Sybase(SAP) Replication Server product, those replicated transactions sent through repserver likely have all the info you need to know which tables/rows are being affected; so you have a couple options:
route a copy of the transactions to another database where you can capture the transactions in whatever format you need [you'll need to design the database and/or any customized repserver function strings]
consider using the Sybase(SAP) Real Time Data Streaming product (yeah, additional li$ence is required) which is specifically designed for scenarios like yours, ie, pull transactions off the repserver queues and format for use in downstream systems (eg, tibco/mqs, custom apps)
I'm not aware of any 'generic' products that work, out of the box, as per your (limited) requirements. You're likely looking at some different solutions and/or customized code to cover your particular situation.
In Hive, you can create two kinds of tables: Managed and External
In case of managed table, you own the data and hence when you drop the table the data is deleted.
In case of external table, you don't have ownership of the data and hence when you delete such a table, the underlying data is not deleted. Only metadata is deleted.
Now, recently i have observed that you can not create an external table over a location on which you don't have write (modification) permissions in HDFS. I completely fail to understand this.
Use case: It is quite common that the data you are churning is huge and read-only. So, to churn such data via Hive, will you have to copy this huge data to a location on which you have write permissions?
Please help.
Though it is true that dropping an external data does not result in dropping the data, this does not mean that external tables are for reading only. For instance, you should be able to do an INSERT OVERWRITE on an external table.
That being said, it is definitely possible to use (internal) tables when you only have read access, so I suspect this is the case for external tables as well. Try creating the table with an account that has write acces, and then using it with your regular account.
Wanted some advice on how to deal with table operations (rename column) in Google BigQuery.
Currently, I have a wrapper to do this. My tables are partitioned by date. eg: if I have a table name fact, I will have several tables named:
fact_20160301
fact_20160302
fact_20160303... etc
My rename column wrapper generates aliased queries. ie. if I want to change my table schema from
['address', 'name', 'city'] -> ['location', 'firstname', 'town']
I do batch query operation:
select address as location, name as firstname, city as town
and do a WRITE_TRUNCATE on the parent tables.
My main issues lies with the fact that BigQuery only supports 50 concurrent jobs. This means, that when I submit my batch request, I can only do around 30 partitions at a time, since I'd like to reserve 20 spots for ETL jobs that are runnings.
Also, I haven't found of a way where you can do a poll_job on a batch operation to see whether or not all jobs in a batch have completed.
If anyone has some tips or tricks, I'd love to hear them.
I can propose two options
Using View
Views creation is very simple to script out and execute - it is fast and free to compare with cost of scanning whole table with select into approach.
You can create view using Tables: insert API with properly set type property
Using Jobs: insert EXTRACT and then LOAD
Here you can extract table to GCS and then load it back to GBQ with adjusted schema
Above approach will a) eliminate cost cost of querying (scan) tables and b) can help with limitations. But might not depends on the actual volumke of tables and other requirements you might have
The best way to manipulate a schema is through the Google Big Query API.
Use the tables get api to retrieve the existing schema for your table. https://cloud.google.com/bigquery/docs/reference/v2/tables/get
Manipulate your schema file, renaming columns etc.
Again using the api perform an update on the schema, setting it to your newly modified version. This should all occur in one job https://cloud.google.com/bigquery/docs/reference/v2/tables/update
For a project we need to investigate an existing installation of IBM Data Stage, doing a whole lot of ETL in loads of jobs.
The job flow diagrams contain lots of tables being used a source (both in MSSQL as well as Oracle), as well as a target (mostly in Oracle).
My question is now
How can I find all database tables used by all jobs in a certain Data Stage Project ?
I looked in Tools - Advanced Find, and there I can see all "table definitions". BUT, most of the tables actually used in jobs do not show up there, as they are defined as what Data Stage calls "Parallel Jobs" which in effect are SQL queries against database tables.
I am particularly interested in locating TARGET tables which are being loaded by a job.
So to put it bluntly, I want to be able to answer the question "Which job loads table XY ?".
If that is not possible, an automated means of extracting all the SQL statements used by the jobs would be an alternative.
We have access to IBM Websphere Data Stage and Quality Stage Designer 8.1
Exporting the jobs creates a text file that details what the job does. Open the export file in a text editor and you should be able to find SQL inserts with a simple search. Start with searching for SQL keywords like 'INTO' and 'FROM'.
Edit: Alternatively, if every table that was used was defined by importing table definitions, you should be able to find the table definition in the folder for its type. This however, will not make it apparent where and how the table was used (which job, insert or select from?), so I would recommend the first method of searching the Export files.
I'm writing an online tax return filing application using MVC3 and EF 4.1. Part of the application requires that the taxpayer be able to upload documents associated with their return. The users will be able to come back days or weeks later and possibly upload additional documents. Prior to finally submitting their return the user is able to view a list of files that have been uploaded. I've written the application to save the uploaded files to a directory defined in the web.config. When I display the review page to the user I loop through the files in the directory and display it as a list.
I'm now thinking that I should be saving the files to the actual SQL Server as binary data in addition to saving them to the directory. I'm trying to avoid what if scenarios.
What if
A staff member accidentally deletes a file from the directory.
The file server crashes (Other agencies use the same SAN as us)
A staff member saves other files to the same directory. The taxpayer should not see those
Any other scenario that causes us to have to request another copy of a file from a taxpayer (Failure is not an option)
I'm concerned that saving to the SQL Server database will have dire consequences that I am not aware of since I've not done this before in a production environment.
There's a really good paper by Microsoft Research called To Blob or Not To Blob.
Their conclusion after a large number of performance tests and analysis is this:
if your pictures or document are typically below 256K in size, storing them in a database VARBINARY column is more efficient
if your pictures or document are typically over 1 MB in size, storing them in the filesystem is more efficient (and with SQL Server 2008's FILESTREAM attribute, they're still under transactional control and part of the database)
in between those two, it's a bit of a toss-up depending on your use
If you decide to put your pictures into a SQL Server table, I would strongly recommend using a separate table for storing those pictures - do not store the employee foto in the employee table - keep them in a separate table. That way, the Employee table can stay lean and mean and very efficient, assuming you don't always need to select the employee foto, too, as part of your queries.
For filegroups, check out Files and Filegroup Architecture for an intro. Basically, you would either create your database with a separate filegroup for large data structures right from the beginning, or add an additional filegroup later. Let's call it "LARGE_DATA".
Now, whenever you have a new table to create which needs to store VARCHAR(MAX) or VARBINARY(MAX) columns, you can specify this file group for the large data:
CREATE TABLE dbo.YourTable
(....... define the fields here ......)
ON Data -- the basic "Data" filegroup for the regular data
TEXTIMAGE_ON LARGE_DATA -- the filegroup for large chunks of data
Check out the MSDN intro on filegroups, and play around with it!