DTS vs. SSIS vs. Informatica vs. PL/SQL Scripting - etl

In the past, I have used Informatica for some ETL (Extraction Transformation Loading) but found it rather slow and usually replaced it with some PL/SQL scripts (was using Oracle at the time).
(questions revised based on feedback in answers)
I gather that DTS was Microsoft's ETL tool prior to SSIS.
Would it be difficult to convert an existing application using DTS to SSIS?
Given that SSIS is a Microsoft tool and tightly integrated with SQL Server (virtually a part of it) are there any drawbacks to using it? I don't see any efficiency issues, since I imagine that you can do anything in SSIS that you could without it with regard to ETL.

I believe SSIS is Microsoft's ETL tool today, replacing DTS.
It's important to remember that ETL performance has as much to do with your schema and how you're doing the transfer as it does the tool. For example, if you've got indexes they'll run slower than if you do a bulk transfer and create the indexes after it's done. If you do a large batch all at once you're creating rollback logs that increase in size and slow the process down. It could be that smaller batches will run faster, because the rollback log doesn't have to be as big.
Don't give in to the knee-jerk reaction and blame the tool. Look critically at how you're doing it to make sure that you're not shooting yourself in the foot.

That's correct, DTS was MS tool for ETL prior to SSIS. While I have never seen DTS before, I believe SSIS is much more user friendly and GUI based in comparison to DTS. Speaking of user-friendly, my first experience with ETL was with Informatica, and I strongly believe that the user-friendliness of Informatica beats SSIS. Inudstry does recognize Informatica to be much more stable and advanced as opposed to SSIS.

SSIS has got it's problems
Does not work with Excel correctly (because of mixed data types, well known problem)
Does everything in the memory = you need a lot of memory.
Especially for sorting large files.
You cannot specify which algorithm to use for sorting.
For example it would be nice to be able to use Merge sort
because does not require a lot of memory.

Your information is badly out of date. The current Microsoft ETL tool is SQL Server Integration Services (SSIS).

Related

ETL vs Workflow Management and which to apply? Can they be used the same?

I am in the process of setting up a data pipeline for a client. I've spent a number of years being on the analysis side of things but now I am working with a small shop that only really has a production environment.
The first thing we did was to create a replicated instance of production but I would like to apply a sort of data warehouse mentality to make the analysis portion easier.
My question comes down to what tool to use? Also, why? I have been looking at solutions like Talened for ETL but also am very interested in Airflow. The problem is that I'm not quite sure which suits my needs better. I would like to monitor and create jobs easily (I write python pretty fluently so Airflow job creation isn't an issue) but also be able to transform the data as it comes in.
Any suggestions are much appreciated
Please consider that the open source of talend (Talend Open Studio) does not provide any monitoring / scheduling capabilities. It is only "code generator". The more sophisticated infrastructure is part of the enterprise editions.
For anyone that sees this. Four years later and what we have done is leverage Airflow for scheduling, Fivetran and/or Sticher for extraction and loading, and dbt for transformations.

ETL Tool Which is most configurable

I am looking for best suited ETL Tool for the following criteria.
Supports MongoDB
Accepts Metadata as input (Or accepts file and builds its metadata on the fly)
provides configurable Mapping. (mapping can be defined from outside development, using some file ot table)
Please suggest the tool which caters to the above needs.
Hmm, your questing is looking most configurable ETL tool. From past years of experience in ETL process, I can inform you that you will never find such tool that meets all your demands. Especially when you have Enterprise level data warehouse (needed because of high and complex reporting needs), the only one software solution is to build your own custom project based ETL software, which is often ungrateful.
But (big BUT), you can achieve at least 80% of needs with existing tools. Plugins, smart usage of scripts, good data-flow design and (if needed) small custom software in pair with scheduling could help you out to fulfill imagined process. ETL process doesn't seem to be different in compare to any other work - 80% of the work is done in 20% of time, and the rest of work (20%) is done in 80% of time.
My suggestion for you:
Pentaho Data Integration - free and open source
PDI is powerfull ETL tool, and surley can meet your demands. There is a plenty of plugins, solid level community and fine API if you're going to develop more plugins.
Pentaho Data Integration + Integration Server - Enterprise Edition - "cheap enough" for almost every medium size projects
Enterprise edition has everything like free edition, including more plugins (JMS producer for example), version control system, instaview's and ect.
Beside, it has it own Server so scheduling is software based (not OS based), logging, better management and most important thing - support!
Informatica or Microsoft SSIS - expensive and brilliant
I would not wasting words for this tools. Informatica is primary ETL oriented company that using Informatica on high level require deep understanding of DB/DWH design, ETL process, PL/SQL, dimensional modeling ect.
SSIS is primary constructed for SQL Server, so I don't see high usage needs if at least one of your source db or target db (DWH) is not running on SQL Server.
Conclusion
This is just a scratch of plenty tools that market provide to us. Someone else would probably not even mention these tools. Please look one of the lists.
Almost each BI system has it own ETL tool. Maybe the good choice would be to use it together, in that way you will be in possibility to use maximum from both.
Note: Good ETL project manager, or ETL developer can extend tool advantages to level that better/more expensive have!

Is Pentaho ETL and Data Analyzer good choice?

I was looking for ETL tool and on google found lot about Pentaho Kettle.
I also need a Data Analyzer to run on Star Schema so that business user can play around and generate any kind of report or matrix. Again PentaHo Analyzer is looking good.
Other part of the application will be developed in java and the application should be database agnostic.
Is Pentaho good enough or there are other tools I should check.
Pentaho seems to be pretty solid, offering the whole suite of BI tools, with improved integration reportedly on the way. But...the chances are that companies wanting to go the open source route for their BI solution are also most likely to end up using open source database technology...and in that sense "database agnostic" can easily be a double-edged sword. For instance, you can develop a cube in Microsoft's Analysis Services in the comfortable knowledge that whatver MDX/XMLA your cube sends to the database will be intrepeted consistently, holding very little in the way of nasty surprises.
Compare that to the Pentaho stack, which will typically end interacting with Postgresql or Mysql. I can't vouch for how Postgresql performs in the OLAP realm, but I do know from experience that Mysql - for all its undoubted strengths - has "issues" with the types of SQL that typically crops up all over the place in an OLAP solution (you can't get far in a cube without using GROUP BY or COUNT DISTINCT). So part of what you save in licence costs will almost certainly be used to solve issues arising from the fact the Pentaho doesn't always know which database it is talking to - robbing Peter to (at least partially) pay Paul, so to speak.
Unfortunately, more info is needed. For example:
will you need to exchange data with well-known apps (Oracle Financials, Remedy, etc)? If so, you can save a ton of time & money with an ETL solution that has support for that interface already built-in.
what database products (and versions) and file types do you need to talk to?
do you need to support querying of web-services?
do you need near real-time trickling of data?
do you need rule-level auditing & counts for accounting for every single row
do you need delta processing?
what kinds of machines do you need this to run on? linux? windows? mainframe?
what kind of version control, testing and build processes will this tool have to comply with?
what kind of performance & scalability do you need?
do you mind if the database ends up driving the transformations?
do you need this to run in userspace?
do you need to run parts of it on various networks disconnected from the rest? (not uncommon for extract processes)
how many interfaces and of what complexity do you need to support?
You can spend a lot of time deploying and learning an ETL tool - only to discover that it really doesn't meet your needs very well. You're best off taking a couple of hours to figure that out first.
I've used Talend before with some success. You create your translation by chaining operations together in a graphical designer. There were definitely some WTF's and it was difficult to deal with multi-line records, but it worked well otherwise.
Talend also generates Java and you can access the ETL processes remotely. The tool is also free, although they provide enterprise training and support.
There are lots of choices. Look at BIRT, Talend and Pentaho, if you want free tools. If you want much more robustness, look at Tableau and BIRT Analytics.

Reporting with DB2

I started learning .net about 3 years ago. I have gone thru a boot camp during that time learning OO and various data access technologies such as NHibernate, Subsonic, LINQ TO SQL.
didn't wanna try EF cause it hasn't reached version 3 :)
As far as reporting goes, I have heard that many ORM'S fall flat on their face when it comes to reporting. We have AS400 OR DB2 as our backend. I have heard that LLBLGEN does a good job on reporting for this product. But it is a commercial product and not FREE. Can someone point me to some good resources for Reporting from DB2? thanks for any links/blog articles
Reporting on DB2 will work the same as reporting on almost any other database - you can use ODBC, JDBC or native DB2 calls to the database. So, you don't need DB2 reporting references - any database reporting references should meet your needs.
The only thing special about DB2 might be a little of the syntax extensions, and how you scale up the back end through parallel database servers (like MapReduce, Teradata, etc). But neither should be of much concern - since it's extremely ansi compliant and the scalability should be largely invisible to the reporting developer.
And Crystal Reports, Brio, Cognos, Business Objects, Microstrategy, Actuate, JasperReports, Birt, etc should all work fine.
ORMs are typically terrible for reporting - since they're object rather than set oriented. You'll especially feel the pain with very large data volumes, complex reports or a large number of reports.
Please, don't overlook the most obvious answer: Query/400!
It is native iSeries software. You configure and runs the report on the iSeries but it works great. It is simple, straight forward and maybe a little bit limited but you get most of the works done.
Don't be scared of the green screen or the simple interface. It's really a powerfull tool that does handle the iSeries database very well.
Can someone point me to some good resources for Reporting from DB2?
RPG I!
Light up those indicators!
Query Manager:
You cam use SQL (that can take input parameters) to build it, then create a "form" that will provide totals, level breaks, counts, customized headers, titles, etc.
Query/400 does not accept parameters AFAIK.
Free manual at:
http://publib.boulder.ibm.com/infocenter/iseries/v6r1m0/topic/rzatc/sc415212.pdf

Conversion tool for MS-Excel spreadsheets with macros and VB to Oracle?

Our users have created MS-Excel spreadsheets which over time have evolved into fairly complex applications. They run their part of the business with them. But, never having been exposed to software development discipline, these spreadsheets are brittle, single point of failure, solutions.
Our development group uses Oracle primarily with Java and some other technologies. Is there a tools for conversion from MS-Excel to Oracle? At least part way so we can get a head start and not just have to reverse engineer and rewrite?
There are tools that could convert the data, but trying to converting the formulas would cause the design to be inefficient at best and unusable at worst. The difference between spreadsheets and an Oracle database are similar to the differences between a home gardener and a farmer. Both are useful on their level and some of the same principles apply, but the techniques employed are entirely different.
I suggest you examine the spreadsheets until you understand the goals they are trying to meet and then architect a system in Oracle that meets those goals using the best techniques available in Oracle. The processing will end up being quite different, but the product will be significantly better for it.
You can suck a spreadsheet into MS Access, then push it directly into Oracle as a table (or append it to an existing table). I'm sure you could write an MS Access macro to do it.
It also appears to be possible using SQL Loader from Oracle, but I've never tried that myself.
No there aren't any tools that will covert the formulas or logic. You will have to do that the hard way. You can get the data into Oracle by exporting it as a CSV and using SQL Loader to import it into the database.
You can find tools to migrate the data, but other than that you wouldn't find a tool to automatically do this, you would have to do it all manually.
Even directly copying what was there would not make sense, you were using very limited tools (Excel), you would need to re-analyze the requirements, and possibly modify them before doing anything else.

Resources