SSIS vs Pentaho - etl

Has anyone used both of these to provide a good comparison. I am doing a school project so the cost of SSIS isn't an issue as we already have the license for it.
Background on whats going on. I will be downloading about 10 years of patent information into flat files. The result will be 2,080 delimited files. I want a way to load them into MS SQL server all at once. Then I want to be able to append additional files into the DB as they are released.
Speed of the software doesn't bother me much as I can just let it run overnight. I am just looking for something with some flexibility, and more importantly fairly easy to use. I have never done a project like this before and will be learning how to do this from the boards.
THANKS!

Have used both in real live projects. I do prefer Pentaho (PDI) over SSIS because of its ease of use and flexibility.
Do read a little on the subject before you start using it. There are a couple of excellent books on kettle (PDI), or you could just read the Get Started in the Help menu of PDI. The forum is a good place if you are stuck or ##pentaho on IRC.
What also helps a lot are the Samples that you can find in the Welcome Screen.
I hope you enjoy it, I know I still do. Have been using it since 2006 and am always pissed when I have to use SSIS on some project :-)
PS : use a jtds jdbc-driver to connect to a SQL Server db, it will save you some headaches
Hope this helps,
Bart

After spending a couple days developing an ETL package in PDI and SSIS I feel confident in saying that PDI is definitely more user friendly. The user interface alone is much cleaner and seems to flow in a manner that is very intuitive and as such easy to use.

Related

ETL vs Workflow Management and which to apply? Can they be used the same?

I am in the process of setting up a data pipeline for a client. I've spent a number of years being on the analysis side of things but now I am working with a small shop that only really has a production environment.
The first thing we did was to create a replicated instance of production but I would like to apply a sort of data warehouse mentality to make the analysis portion easier.
My question comes down to what tool to use? Also, why? I have been looking at solutions like Talened for ETL but also am very interested in Airflow. The problem is that I'm not quite sure which suits my needs better. I would like to monitor and create jobs easily (I write python pretty fluently so Airflow job creation isn't an issue) but also be able to transform the data as it comes in.
Any suggestions are much appreciated
Please consider that the open source of talend (Talend Open Studio) does not provide any monitoring / scheduling capabilities. It is only "code generator". The more sophisticated infrastructure is part of the enterprise editions.
For anyone that sees this. Four years later and what we have done is leverage Airflow for scheduling, Fivetran and/or Sticher for extraction and loading, and dbt for transformations.

PostgreSQL config settings on dynamically created EC2 instances

Let me start by saying that I think there is a better way of doing things than I'm doing now... so, please don't post comments and answers saying that I should be using a different technology, etc. I have a "reasonably" specific question.
A little background:
Basically, I have system where I'm processing a lot of varied, but fairly structured data feeds each day (CSV files). It's a fairly generic ETL type of system. I started off writing Python scripts to do it all in memory. But, I found that I was writing a lot of code to check and enforce rules that could easily be described by a db schema. So, I've got a of a series of SQS queue (one for each source) that has file locations (on s3) to process and a PostgreSQL db script to load to do it. Hacky? Yes; probably. But, in a way, it's pretty easy to just define all of your rules in PostgreSQL. At least for me with approx 15 years of RDBMS experience (what's that old saying about when you only have a hammer, everything looks like a nail?)
So, all works pretty well. But, when creating EC2 instances, I have a choice of an image_id and a type/size. I have my base "PostgreSQL worker image" that I use, but it's really geared for one size (micro).
But, now I'm thinking about trying to play around and see what kind of gains I could get if I went with small or medium. My initial thought is that I would just created separate image_ids with a postgres conf settings geared to them. But, seems a bit messy. (but, the whole thing is a bit messy and hacky)
Given what I have in place, is there a better way to accomplish this than just separate AMIs?
Final notes:
My AMIs are all PostgreSQL 9.1 and Ubuntu 12.04. And the DBs are just temporary storage. They only exist for the 15 or 20 minutes they are needed to load/process/output the data.
If you feel like this question could be better answered on the SE's DBA site, then please feel free to add a comment. I usually start with StackOverflow because it's a bigger community and it's a community that I feel more at home with. I'm much more of a developer than a DBA.

are claims of Oracle being hard to administer on simple tasks correct? aren't there quality admin apps for it? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I see this claim made in a rant here http://discuss.joelonsoftware.com/default.asp?joel.3.456646.47 . As well as in various other rants that can be looked up on google using "oracle sucks". Ok, well, if let's say something as low key as Drupal doesn't have an easy to use visual IDE I can understand why, but if this is really true about something as big money as Oracle, why don't we see an entire ecosystem of user-friendly visual tools for basic DBA work on Oracle? I mean, people who work on Oracle work for companies with big budgets, so surely they could afford a license for a fancy "sit tight and enjoy the ride Oracle admin studio" of some sort to help developers do some stuff by themselves without pestering the DBA? Or do these tools really exist and do good job whereas the people doing the rants are simply unaware of them?
Quest Software has a variety of tools, primarily TOAD but also Spotlight and there is a backup monitoring tool in beta, for database admin.
Part of the issue is that Oracle runs on a variety of platforms, such as Solaris, Linux and Windows. The larger (and therefore more complex) installs have been on more exotic hardware. A 'full stack' admin tool would really have to be native to the database platform, and that just hasn't been practical. That's one reason why the OEM stuff is built as a web-app, and why SQL*Plus, the standard client, has stuck as a command line tool. As has RMAN, the backup/recovery manager.
Another issue is that there is a lot of baggage in Oracle. Rather than a simple "Database = File" or "Table = File" model, Oracle needed to cope with data volumes too big for single files. So they have a concept of a tablespace which maps database objects to data files. That's not so much an issue with modern filesystems.
Finally, Oracle is a high-end product. You use it in situations where the cheaper alternatives can't cut it. So it is often applied in more complex environments which would require more admin anyway. In that way, it is more a case that with Oracle, you can admin your way out of situations which impossible for a competitor product.
There are tools for Oracle, both built-in and third-party.
I think that the tools for SQL Server are a lot easier to use. And third party tools for SQL Server (i.e. Red Gate) are also extremely easy to use and powerful (compared to Toad, which has a byzantine and complex user interface)
Oracle is a multi-platform database and it dates from the original RDBMS implementations generation (one of the first which competed to replace older systems), so it has a lot of layers at install which can be very challenging to deal with. PL/SQL is also more difficult for development compared to SQL Server, MySQL or DB/2 in many ways.
From the point of view of small development shops without dedicated development DBA (or a production DBA who actually understands development) resources, Oracle is less productive than SQL Server or MySQL.
For DBA management and monitoring there's Oracle Enterprise Manager Grid Control. Not an IDE, purely an enterprise-wide administration tool for all of the databases in an organization. Everything from backups to performance monitoring, job creation, alerts, and so forth.
When I was a grasshopper Master Po told me : 'A fool with a tool is still a fool'. As others have pointed out Oracle is a high-end product. You really have to read the documentation, once you understand the basic concepts of oracle there are a lot of tools available. Allmost all tasks are command-line based. A lot of different GUI applications are available to assist you. Oracle's main tools are Enterprise Manager and SQL Developer. Server side you have a few tools you can use: Database Configuration Assitant, Network Configuration Assistent, Migration Assistent, etc. Choose the one you like for a sprecific task. Bottom line is : it's not a point and click application.
If you're deploying Oracle in a large corporate environment, there is an ecosystem of user-friendly tools to administer the database. But most of those tools are relatively painful to install-- they need their own database, for example, and install components on the database server along with the central repository. It makes perfect sense to invest in this sort of heavy-weight infrastructure when you're spending 6 or 7 figures on Oracle database licenses and you need to handle things like continuous monitoring and alerting.
On the other hand, most of the folks that are complaining about Oracle usability are trying to install and run Oracle in a much different environment. If you're a developer, for example, that wants to run Oracle on your local laptop so that you have the full stack installed, you're not going to need or want one of these heavyweight tools. Those folks are going to end up with whatever tools Oracle installs by default. Traditionally, those tools have been somewhat less than ideal. Oracle is getting better about that by shipping a lightweight Enterprise Manager web client with the database that is very useful for these types of installs. But it can still be a bit of a fight to ensure that the Enterprise Manager web client works perfectly on a developer's Windows laptop install which leads a non-trivial number of developers to conclude that "Oracle sucks".
I use an app called PL/SQL developer, and it works pretty well, IMO.
www.enterprise-elements.com is one such tool
You have noticed that you are pointing to a four-year-old rant right? By a supposed DBA who didn't even know enough to turn off unneeded services in order to shorten up the load time?
I'm sorry, but if the complaint is "why can't this industrial-strength DB be managed as easy as this lightweight, feature-poor, freeware?" then I think it is a self-answering question.
To answer the rest, yes there are tools out there. To specifically answer your " I mean, people who work on Oracle work for companies with big budgets, so surely they could afford a license for a fancy "sit tight and enjoy the ride Oracle admin studio" of some sort to help developers do some stuff by themselves without pestering the DBA? " , this is more often a factor of a DBA choosing to lock down privileges - not a function of the database itself. A tool is no use to a developer if their user account is not granted the rights to do what they want.
Rants like that one? Looks like someone tasked with running an app they had no interest in actually learning much about. No wonder they got frustrated. Yes, sometimes Oracle causes frustration of its own, but many of these rants are from people who probably picked a database platform far above their needs, and are disinclined to really learn how to manage it.

Is Pentaho ETL and Data Analyzer good choice?

I was looking for ETL tool and on google found lot about Pentaho Kettle.
I also need a Data Analyzer to run on Star Schema so that business user can play around and generate any kind of report or matrix. Again PentaHo Analyzer is looking good.
Other part of the application will be developed in java and the application should be database agnostic.
Is Pentaho good enough or there are other tools I should check.
Pentaho seems to be pretty solid, offering the whole suite of BI tools, with improved integration reportedly on the way. But...the chances are that companies wanting to go the open source route for their BI solution are also most likely to end up using open source database technology...and in that sense "database agnostic" can easily be a double-edged sword. For instance, you can develop a cube in Microsoft's Analysis Services in the comfortable knowledge that whatver MDX/XMLA your cube sends to the database will be intrepeted consistently, holding very little in the way of nasty surprises.
Compare that to the Pentaho stack, which will typically end interacting with Postgresql or Mysql. I can't vouch for how Postgresql performs in the OLAP realm, but I do know from experience that Mysql - for all its undoubted strengths - has "issues" with the types of SQL that typically crops up all over the place in an OLAP solution (you can't get far in a cube without using GROUP BY or COUNT DISTINCT). So part of what you save in licence costs will almost certainly be used to solve issues arising from the fact the Pentaho doesn't always know which database it is talking to - robbing Peter to (at least partially) pay Paul, so to speak.
Unfortunately, more info is needed. For example:
will you need to exchange data with well-known apps (Oracle Financials, Remedy, etc)? If so, you can save a ton of time & money with an ETL solution that has support for that interface already built-in.
what database products (and versions) and file types do you need to talk to?
do you need to support querying of web-services?
do you need near real-time trickling of data?
do you need rule-level auditing & counts for accounting for every single row
do you need delta processing?
what kinds of machines do you need this to run on? linux? windows? mainframe?
what kind of version control, testing and build processes will this tool have to comply with?
what kind of performance & scalability do you need?
do you mind if the database ends up driving the transformations?
do you need this to run in userspace?
do you need to run parts of it on various networks disconnected from the rest? (not uncommon for extract processes)
how many interfaces and of what complexity do you need to support?
You can spend a lot of time deploying and learning an ETL tool - only to discover that it really doesn't meet your needs very well. You're best off taking a couple of hours to figure that out first.
I've used Talend before with some success. You create your translation by chaining operations together in a graphical designer. There were definitely some WTF's and it was difficult to deal with multi-line records, but it worked well otherwise.
Talend also generates Java and you can access the ETL processes remotely. The tool is also free, although they provide enterprise training and support.
There are lots of choices. Look at BIRT, Talend and Pentaho, if you want free tools. If you want much more robustness, look at Tableau and BIRT Analytics.

What is the most challenging development environment you've ever had to work in and what did you do to get around the limitations?

By 'challenging development environment' I don't mean you're on a small boat that's rocking up and down and someone is holding a gun to your head. I mean, are the tools at your disposal making the problem difficult?
Development is typically a cycle of code, run, observe the result, repeat. In some environments this is a very quick and painless process, but in others it's very difficult. We end up using little tricks to help us observe the result and run the code faster.
I was thinking of this because I just started using SSIS (an ETL tool included with SQL Server 2005/8). It's been quite challenging for me to make progress, mainly because there's no guidance on what all the dialogs mean and also because the errors are very cryptic and most of the time don't really tell you what the problem is.
I think the easiest environment I've had to work in was VB6 because there you can edit code while the application is running and it will continue running with your new code! You don't even have to run it again. This can save you a lot of time. Surprisingly, in Netbeans with Java code, you can do the same. It steps out of the method and re-runs the method with the new code.
In SQL Server 2000 when there is an error in a trigger you get no stack trace, which can make it really tricky to locate where the problem occurred since an insert can have a cascading effect and trigger many triggers. In Oracle you get a very nice little stack trace with line numbers so resolving the problem is very easy.
Some of the things that I see really help in locating problems:
Good error messages when things go wrong.
Providing a stack trace when a problem occurs.
Debug environment where you can pause, then output the value of variables and step to follow the execution path.
A graphical debug environment that shows the code as it's running.
A screen that can show the current values of variables so you can print to them.
Ability to turn on debug logging on a production system.
What is the worst you've seen and what can be done to get around these limitations?
EDIT
I really didn't intend for this to be flame bait. I'm really just looking for ideas to improve systems so that if I'm creating something I'll think about these things and not contribute to people's problems. I'm also looking for creative ways around these limitations that I can use if I find myself in this position.
I was working on making modifications to Magento for a client. There is very little information on how the Magento system is organized. There are hundreds of folders and files, and there are at least a thousand view files. There was little support available from Magento forums, and I suspect the main reason for this lack of information is because the creators of Magento want you to pay them to become a certified Magento developer. Also, at that time last year there was no StackOverflow :)
My first task was to figure out how the database schema worked and which table stored some attributes I was looking for. There are over 300 tables in Magento, and I couldn't find out how the SQL queries were being done. So I had just one option...
I exported the entire database (300+ tables, and at least 20,000 lines of SQL code) into a .sql file using PhpMyAdmin, and I 'committed' this file into the subversion repositry. Then, I made some changes to the database using the Magento administration panel, and redownloaded the .sql file. Then, I ran a DIFF using TortioseSvn, and scrolled through the 20k+ lines file to find which lines had changed, LOL. As crazy as it sounds, it did work, and I was able to figure out which tables I needed to access.
My 2nd problem was, because of the crazy directory structure, I had to ftp to about 3 folders at the same time for trivial changes. So I had to keep 3 windows of my ftp program open, switch between them and ftp each time.
The 3rd problem was figuring out how the url mapping worked and where some of the code I wanted was being stored. Here, by sheer luck, I managed to find the Model class I was looking for.
Mostly by sheer luck and other similar crazy adventures I managed to work my way through and complete the project. Since then, StackOverflow was started and by a helpful answer to this bounty question I was able to finally get enough information about Magento that I can do future projects in a less crazy manner (hopefully).
Try keypunching your card deck in Fortran, complete with IBM JCL (Job Control Language), handing it in at the data center window, coming back the next morning and getting an inch-thick stack of printer paper with the hex dump of your crash, and a list of the charges to your account.
Grows hair on your fingernails.
I guess that was an improvement on the prior method of sitting at the console, toggling switches and reading the lights.
Occam on a 400x transputer network. As there was only one transputer that could output to console debugging was a nightmare. Had to build a test harness on a Sun network.
I took a class once, that was loosely based on SICP, except it was taught in Dylan rather than Scheme. Actually, it was in the old Dylan syntax, the prefix one that was based on Scheme. But because there were no interpreters for that old version of Dylan, the professor wrote one. In Java. As an applet. Which meant that it had no access to the filesystem; you had to write all of your code in a separate text editor, and then paste it into the Dylan interpreter. Oh, and it had no debugging facilities, of course. And being a Dylan interpreter written in Java, and this was back in 2000, it was ungodly slow.
Print statement debugging, lots of copying and pasting, and an awful lot of cursing at the interpreter were involved.
Back in the 90's, I was developing applications in Clipper, a compilable dBase-like language. I don't remember if it came with a debugger, we often used a 3rd-party debugger called 'Mr Debug' (really!). Although Clipper was fast, some of our more intensive routines were written in C. If you prayed to the correct gods and uttered the necessary incantations, you could use Microsoft's CodeView debugger to debug the C code. But usually not for more than a few minutes, as the program usually didn't like to spend much time running with CodeView (usually memory problems).
I had a series of makefile switches that I used to stub out the sections of code that I didn't need to debug at the time. My debugging environment was very sparse so there was as much free memory as possible for the program. I also think I drank a lot more back then...
Some years ago I reverse engineered game copy protections. Because the protections was written in C or C++ they were fairly easy to disassemble and understand what was going on. But in some cases it got hairy when the copy protection took a detour into the kernel to obfuscate what was happening. A few of them also started to use of custom made virtual machines to make the problem less understandable. I spent hours writing hooks and debuggers to be able to trace into them. The environment really offered a competetive and innovative mind. I had everything at my disposal save time. Misstakes caused reboots and very little feedback what went wrong. I realized thinking before acting is often a better solution.
Today I dispise debuggers. If the problem is in code visible to me I find it easiest to use verbose logging. (Sometimes the error is not understanding the interface/environment then debuggers are good.) I have also realized time is of an essance. You need to have a good working environment with possibility to test your code instantly. If you compiler takes 15 sec, your environment takes 20 sec to update or your caches takes 5 minutes to clear find another way to test your code. Progress keeps me motivated and without a good working environment I get bored, angry and frustrated.
The last job I had I was a Sitecore Developer. Bugfixing can be very painful if the bug only occurs on the client's system, and they do not have Visual Studio installed on the system, with the remote debugging off, and the problem only happens on the production server (not the staging server).
The worst in recent memory was developing SSRS reports using Dundas controls. We were doing quite a bit with the grids which required coding. The pain was the bugginess of the controls, and the lack of debugging support.
I never got around the limitations, but just worked through them.

Resources