Sorry the really newbie question but I want to get a simple answer.
We have our data come in to us and it gets stored as raw data. Currently we have an ICT team who have produced an ETL process that does a lot of work turning it into something usable for analysis but this takes a lot of time. The issue is that the time taken means that the data quality report that needs to go back is often late and prevents issues from being resolved. As DQ needs to be done on raw data and not cleaned, can I run build a second ETL process to create the DQ tables.
By this I mean can they both be run simultaneously as the main ETL will be an Oracle process using the raw data as a data source and the one I want to build will also use raw data in Oracle as a data source using SSIS.
I know it's probably a simple question, but as it's not something I do on a day to day basis so I don't know if it can be done this way or if one will have to wait until the other has finished.
We are storing the data in Oracle 11g and doing the major process there. I'll be using SQL 2008 R2 to what I want.
Thanks for your help
Related
i'm trying to populate a database table (access) with the "Microsoft Access Output" step, but I get very bad performance. I mean my data was read from 2 xmls and got merged in 1 minute (36000 rows of data) and the access output is running now for 1 hour with 12 r/s. I set the Commit size high enough to commit all ma data at once (with Commint size 500 I got some java error after the 10th commit).
If I write my file in a csv and import it in access, that is way more quicker, but I would want to automate as much as possible from the process.
Any suggestion is welcome on how to get better performance out of this.
I've never had good luck with the MS-Access output step. It seems to work much better if you create an ODBC entry for your Access DB and load it with a regular Table Output step.
Also you might check that you have the most current version of the Jackcess driver.
Edit: also, for automating flows of data, like writing a CSV and then loading it, you would use a Job. If you're only familiar with transforms, look at the docs on ETL jobs.
I am in the process of creating an Oracle to Vertica process!
We are looking to create a Vertica DB that will run heavy reports. For now is all cool Vertica is fast space use is great and all well and nice until we get to the main part getting the data from Oracle to Vertica.
OK, initial load is ok, dump to csv from Oracle to Vertica, load times are a joke no problem so far everybody things is bad joke or there's some magic stuff going on! well is Simply Fast.
Bad Part Now -> Databases are up and going ORACLE/VERTICA - and I have data getting altered in ORACLE so I need to replicate my data in VERTICA. What now:
From my tests and from what I can understand about Vertica insert, updates are not to used unless maybe max 20 per sec - so real time replication is out of question.
So I was thinking to read the arch log from oracle and ETL -it to create CSV data with the new data, altered data, deleted values-changed data and then applied it into VERTICA but I can not get a list like this:
Because explicit data change in VERTICA leads to slow performance.
So I am looking for some ideas about how I can solve this issue, knowing I cannot:
Alter my ORACLE production structure.
Use ORACLE env resources for filtering the data.
Cannot use insert, update or delete statements in my VERTICA load process.
Things I depend on:
The use of copy command
Data consistency
A max of 60 min window(every 60 min - new/altered data need to go to VERTICA).
I have seen the Continuent data replication, but it seems that nowbody wants to sell their prod, I cannot get in touch with them.
will loading the whole data to a new table
and then replacing them be acceptable?
copy new() ...
-- you can swap tables in one command:
alter table old,new,swap rename to swap,old,new;
truncate new;
Extract data from Oracle(in .csv format) and load it using Vertica COPY command. Write a simple shell script to automate this process.
I used to use Talend(ETL), but it was very slow then moved to the conventional process and it has really worked for me. Currently processing 18M records, my entire process takes less than 2 min.
I have a stored procedure that returns about 50000 records in 10sec using at most 2 cores in SSMS. The SSRS report using the stored procedure was taking 20min and would max out the processor on an 8 core server for the entire time. The report was relatively simple (i.e. no graphs, calculations). The report did not appear to be the issue as I wrote the 50K rows to a temp table and the report could display the data in a few seconds. I tried many different ideas for testing altering the stored procedure each time, but keeping the original code in a separate window to revert back to. After one Alter of the stored procedure, going back to the original code, the report and server utilization started running fast, comparable to the performance of the stored procedure alone. Everything is fine for now, but I am would like to get to the bottom of what caused this in case it happens again. Any ideas?
I'd start with a SQL Profiler trace of both the stored procedure when you execute it normally, and then the same SP when it's called by SSRS. Make sure you include the execution plans involved, so you can see if it's making some bad decisions (though that seems unlikely - the SQL Server should execute an optimal - or at least consistent - plan regardless of the query's source).
We used to have cases where Business Objects would execute stored procs dozens of times for no aparent reason and it lead to occasionally horrible performance, though I've never seen that same behavior with SSRS. It may be somewhere to start, though. You'll also see the execution begin/end times - that will make it clear if it's the database layer that's hanging up, or if the SQL Server hands back the data in 10 seconds and then it's the SSRS service that's choking somewhere.
The primary solution to speeding SSRS reports is to cache the reports. If one does this (either my preloading the cache at 7:30 am for instance) or caches the reports on-hit, one will find massive gains in load speed.
You may also find that monthly restarts of SSRS application domain to resolve your issue.
Please note that I do this daily and professionally and am not simply waxing poetic on SSRS
Caching in SSRS
http://msdn.microsoft.com/en-us/library/ms155927.aspx
Pre-loading the Cache
http://msdn.microsoft.com/en-us/library/ms155876.aspx
If you do not like initial reports taking long and your data is static i.e. a daily general ledger or the like, meaning the data is relatively static over the day, you may increase the cache life-span.
Finally, you may also opt for business managers to instead receive these reports via email subscriptions, which will send them a point in time Excel report which they may find easier and more systematic.
You can also use parameters in SSRS to allow for easy parsing by the user and faster queries. In the query builder type IN(#SSN) under the Filter column that you wish to parameterize, you will then find it created in the parameter folder just above data sources in the upper left of your BIDS GUI.
[If you do not see the data source section in SSRS, hit CTRL+ALT+D.
See a nearly identical question here: Performance Issuses with SSRS
I need every day to transfer large amounts of data (about several millions records) from db2 to oracle database. Could u suggest the best perfoming method to do that?
DB2 will allow you to select Oracle as a replication target. This is probably the most efficient and easiest way to do it every day, it also removes the "intermediate container" objection that you have.
See this introduction (and more from the documentation online) for more.
If you're only talking speed then do this.
Time how long it takes to dump the DB2 data to a flatfile.
Time how long it takes to suck that flatfile into Oracle.
there's your baseline and it's free. If you can beat that with an ETL tool, then decide if the cost of the tool is worth it.
For a simple ETL like this, there's little that I've found that can beat this on time.
The downside of this is just general file manipulation BS...
how do you know when to read from the file
how do you know that you got all the rows
how do you resume when something breaks
All those little "niceties" usually get paid for with speed. Of course, I'm joking a bit. They aren't always a little nicety. They are often essential for a smooth running process.
Dump out data to delimited file. Load to Oracle via DIRECT load sqlldr job. Not sexy, but fast. If you can be on same subnet that would be best (pushing data across the network is not what you want). Set this up on a cron, add email alerts on errors
This is a problem that my friend asked over the phone. The C# 3.5 program he has written is filling a Dataset from a Patient Master table which has 350,000 records. It uses the Microsoft ADO.NET driver for Oracle. The ExecuteQuery method takes over 30 seconds to fill the dataset. However, the same query (fetching about 20K records) takes less than 3 second in Toad . He is not using any Transactions within the program. It has an index on the column (Name) which is being used to search.
These are some alternatives i suggested :-
1) Try to use a Data Reader and then populate a Data table and pass it to the form to bind it to the Combo box (which is not a good idea since it is likely to take same time)
2) Try Oracles' ADO.NET Driver
3) Use Ants profiler to see if you can identify any particular ADO.NET line.
Has anyone faced similar problems and what are some ways of resolving this.
Thanks,
Chak.
You really need to do an extended SQL trace to see where the slowness is coming from. Here is a paper from Cary Millsap (of Method R and formerly of Hotsos) that details doing this:
http://method-r.com/downloads/doc_details/10-for-developers-making-friends-with-the-oracle-database-cary-millsap
Toad would typically only fetch the first x rows (500 in my setup). So double check if the comparison is valid.
Then you should try to seperate the db stuff from the form stuff if possible to see if the db is taking up the time.
If that's the case, try the Oracle libraries if that is any faster, we've seen 50% improvements between the latest Oracle driver and the standard Microsoft driver.
Without knowing the actual code he uses to accomplish his tasks and not knowing the number of rows he's actually fetching (I'm hoping he doesn't read all 350K of them?) it's impossible to say anything that's gonna help him.
Have him add a code snippet to the question for clarity.