Is Pentaho ETL tool dying? - etl

I have problems finding useful pentaho ETL tool information, is this tool dying?
What are the alternative tools/platform>

In short, yes, Hitachi Vantara aquired the suite and they aren't giving much love to it. They've just released the 9.2 version, but I don't have much faith in having a lot of improvements for the CE version, maybe some more for the paid version, but I don't think there's going to be much either, they killed the old forum and the new community they created is deserted.
For the Data Integration (ETL) part of the suite, you can go to apache-hop (https://hop.apache.org). A bunch of Kettle (the original name for the Pentaho Data Integration tool of Pentaho Suite) developers are actively modernizing and much improving the old code: it mostly works with Java 11, they are staying in Java 8 because of dependencies of the Beam plugins, the dependencies with old deprecated libraries are gone, for now there's not a lot of new functionality, just migration with some new features, but even if there isn't a 1.0 version yet, it is much advanced and some PDI/Kettle users are beginning to transition their production environment to this new tool.
There's a migration utility for the old jobs and transformations in PDI to workflows and pipelines in Hop. After applying the migration tool you're going to need to check and modify things yet, DB connections are possibly going to need some work afterwards, and a few of the old steps aren't available in Hop (Formula step is the one that most affect me) but in general the utility saves a lot of work.
New things in Hop:
Built with the idea of supporting project and environment configuration, so paths and information that are project/machine dependent work with different configurations, you just change the project or environment information and everything works.
Much better metadata injection support, PDI/Kettle still had a lot of steps with properties not available for metadata injection, and in the migration they have added it
Night mode
Much lighter, quicker start (PDI takes a long time just to initialize) and you can get rid of steps/plugins that you don't need if you just need a thin client to perform one task.
Hop-web

Related

why do we need complexity in dependency management

I am not sure if the title of the question is correct, but please read the question.
I have been working on C/C++ for most part of my work life (close to 11 years). we only had C/C++ source/header files and all dependencies were managed by Makefiles. things were simple and manageable.
for the last 1.5 years i have shifted to Java domain. and i feel extremely irritated that most difficult aspect of working with anything new is the dependency managers. e.g. maven, leiningen, builder, sbt, etc etc etc.
whenever i download anything new from the open source world, there is a significant amount of time to be spent to just to setup the compilation, build, run environment. that too when i am using eclipse. why can't all the dependencies be placed along with the software to be downloaded?? why the tools like maven,leiningen, etc must make a separate internet connection to download the dependencies. i know that maven forms a local repository and should be able to find the dependency locally as it downloads whole internet anyway, but why is this model used. I am behind a firewall and not everything is accessible, and the tools fail to download dependencies. i am sure the same situation is there in most work environments.
recently i started with clojure, and boy it has been a pain to get eclipse configured for clojure. leiningen is supposed to be some magic which must be used with any clojure development. sometimes it feels learning leiningen is more important than learning concepts of clojure. i downloaded so called 'standalone' jar file for leiningen as 'self-install' was not working for me. but it fooled me. as soon as i run 'lein' command it is making an internet connection and trying to download somethings. WHY? it wont even print the help menu without connecting with the internet. WHY? there is no way i can fulfill its demands without bypassing my internet firewall, as i dont know, and no one can tell me what all things this guy wants. there is simply no other way.
And every one seems to be inventing their own. Java had ant which was simple, and went to Maven, some project uses Ruby based Builder, Clojure has leiningen, Scala has sbt. Go has something else. WHY? Why we need this added complexity in a world already full of complexity. why cant there be just one tool.
All you experts in Java technology please excuse my rant. I am sure this question will be downvoted and closed as from someone who is not trying hard enough to understand the things. But please believe me i have spend enough hours battling with this unnecessary complexity.
I just want to know how others get around this, or am i the only unfortunate one facing these issues.
I guess this question cannot accept an answer. I humbly can provide you with elements, hopefully they will help you get some perspective on the problem.
There are mainly 2 problems I identify with Java build systems:
some of them are declarative while others are using scripts
the fragmentation of the Java tools for building and exercising control is tied to people and Java stewardship of the space, not so much the technological choices.
Maven is the paramount of a method of defining your build using a formal grammar in a standard manner. Your pom.xml file contains a lot more than just your build : it is the identity of your artifacts, the project metadata, the modules and the plugins brought in. It treats with particular attention of the declaration of the dependencies and repositories.
Maven is declarative.
For a certain population of programmers, this is great, and they don't create new projects very often. It works well over time, it consolidates the build nicely.
Ant is a different system where you define tasks that will execute, chained in a particular order. All the definitions are made using XML and in effect, you are writing scripts and declaring how they will be stitched together.
Buildr (full disclosure: I am a committer there) is a build system which was created off the frustration of dealing with the inefficiency of the declarative approach for cases where the build needed to do additional steps and complex testing and the rigidity of using XML for a build. It is script-based, enforcing conventions over configuration (expecting a few good defaults, but letting you drive if you need to change things).
I am not familiar with Gradle and SBT but I think they extend and build on this approach, from what I heard.
So this gives you I hope a better picture of the landscape in terms of build tools.
The reason why no standard build tool emerged is probably tied to the fact Sun didn't push one with Java. Eventually, I think they adopted Ant (I have some most JSR jars being built with it). There also has been some products built in this space over extending some of those build systems ; there is always going to be a huge difference between people being paid to maintain code rather than doing it on the side.
And well, people argue. Build systems are a great way to start a flame war. We have a hard time agreeing on a standard though some of the common elements are now settling well around the Maven artifacts.
As for the need to download the Internet over and over again, it's a rather long story but here are a few things that may trigger the need for an unnecessary download:
any of the dependencies using SNAPSHOT will try to get the latest snapshot. This is a great scheme but it takes its toll. You might depend on something that depends on a snapshot, and get a download because of that.
Maven doesn't redownload the artifacts but sometimes checks md5. This is easy to fix, just use the -O option from the command line.
Tools like Buildr were built around fixing this issue once for all. First off, you only download what you said you would. Second, no connection is made again unless you asked for it. By default, Buildr doesn't play the transitive dependencies game though you can ask for it, but you have to do it explicitly.
I hope this was informative and that your journey in Java land becomes less painful going forward.

Automated Software Versioning integrated with Issue Control System

I decided to use the following pattern after reading semantic versioning at http://semver.org/. However, I have some unsolved issues in my mind in terms of automaticng and integrating SDLC tools.
Version Pattern:
major.minor.revision.build
Such that;
Major: major changes, should be increamented manually.
Minor: minor changes, should be increamented automatically, whenever a new feature or an enhancement to existing feature is solved in issue tracking system.
Revision: changes not affecting the minor changes, should be increamented automatically, whenever a bug is solved in issue tracking system.
Assume that developers never commit the source unless an issue has been solved in issue tracking system, and the issue tracking system is JIRA in this configuration. This means that there are bugs, improvements, and new features as issue types by default, apart from the tasks.
Furthermore, I am adding a continous integration tool in this configuration, and assume that it is bamboo (by the way, I never used bamboo before, I used Hudson), and I am using Eclipse IDE with mylyn plugin and plus the project is a Maven project (web).
Now, I want to elucidate what I want to do by illustrating following scenario. Analyst (A) opens an issue (I), which is a new feature, related to Maven project (P). As a developer (D), I receive an email about the issue, and I open the task via Mylyn interface in Eclipse. I understand and develop the new feature related to issue (I). Consider, I am a Test Driven Development oriented developer, thus I wrote the Unit, DBUnit, and User-Acceptance (for example using Selenium) tests correspondingly. Finally, I commit the changes to the source control. I think the rest should be cycled automatically but I don't know how can I achieve this? The auto-cycled part is the following:
The Source Control System should have a post-hook script that triggers the Continous integration tool to build the project (P). While building, in the proper phase the test code should be run, and their reports generated. The user-acceptance test should be performed in a dedicated server (For example, jboss, or Tomcat). The order of this acceptance test should be, up the server, run the UA test, then generate the UA test reports and down the server. If all these steps have been successfuly completed, the versioning should be performed. In versioning part, the Maven plugin, or what so ever, should take the number of issues solved from the Issue Tracking System, and increment the related version fragments (minor and revision), at last appends the build number. The fragments of the version may be saved in manifest file in order to show it in User Interface. Last but not the least, the CI tool should deploy it in Test environment. That's all auto-cycled processes I want.
The deployment of the artifact to the production environment should be done automatically or manually?
Let's start with the side question: On the automatic deployment to production, this requires the sign off of "the business" whomever that is. How good do your tests need to be to automatically push to production? Are they good enough that you trust things to just go live? What's your downtime? Is that acceptable? If your tests miss something, can you rollback? Are you monitoring production so you know if you've introduced problems? Generally, the answers to enough of these questions is negative enough that you can't auto-deploy there as the result of a build / autotest event.
As for the tracking, you'll need a few things. You'll need all your assumptions to be true (which I doubt they are, but if you get there that's awesome). You'll also need a build number that can be incremented after build time based on test results. You'll need source changes to be annotated with bug ids. You'll need the build system to parse the source changes and make associations with issues. You'll need an API into the build system so you can get the count of issues associated with the build. Finally you'll need your own bit of scripting to do the query and update the build number accordingly.
That's totally doable, but is it really worth having? What's the value you attach to the numbering scheme?

should changes to the db always be part of CI?

This question came up on the development team I'm working with and we couldn't really get to a consensus:
Should changes to the database be part of the CI script?
Assuming that the application you are working with has a database involved. I think yes because that's the definition of integration. If you aren't including a portion of your application then you aren't really testing your integration. The counter-argument is that the CI server is the place to make sure your basic project setup works -- essentially building a virgin checkout of the latest version of your code.
Is there a "best practices" document for CI that would answer this question? Is this something that is debated among those who are passionate about CI?
Martin Fowler's opinion on it:
A common mistake is not to include everything in the automated build.
The build should include getting the database schema out of the
repository and firing it up in the execution environment.
All code, including DB schema and prepulated table values should both be subject to source control and continous integration. I have seen far to many projects where source control is used - but not on the DB. Instead there is a master database instance where everyone is doing there changes, on the same time. This makes it impossible to do branching and also makes it impossible to recreate an earlier state of the system.
I'm very fond of using Visual Studio 2010 Premium's functionality for database schema handling. It makes the database schema part of the project structure, having the master schema under source control. An fresh database can be created right out of the project. Upgrade scripts to lift existing databases to the new schema are automatically generated.
Doing change management properly for databases without VS2010 Premium or a similar tool would at best be painful - if possible at all. If you don't have that tool support I can understand your collegue that wants to keep the DB out of CI. If you have problems arguing for including the DB in CI, then maybe it is an option to first get a descen toolset for DB work? Once you have the right tools it is a natural step to include the DB in CI.
You have no continuous integration if you have no real integration. This means that all components needed to run your software must be part of CI, otherwise you have something just a bit more sophisticated than source control, but no real CI benefits.
Without database in CI, you can't roll back to specific version of an application and you can't run your test in real, always complete environment.
It is of course not an easy subject. In the project I work in we use alter scripts that needs to be checked in together with source code changes. These scripts are run on our test database to ensure not only the correctness of current build, but also that upgrading/downgrading one version up/down is possible and the process of update itself don't mess anything up. I believe this is the better solution than dropping and recreating whole database, it lets you to have the consistent path to upgrade the database step by step and allows you to use the database in some kind of test environment - with data, users etc.

Handling multiple branches in continuous integration

I've been dealing with the problem of scaling CI at my company and at the same time trying to figure out which approach to take when it comes to CI and multiple branches. There is a similar question at stackoverflow, Multiple feature branches and continuous integration. I've started a new one because I'd like to get more of discussion and provide some analysis in the question.
So far I've found that there are 2 main approaches that I can take (or maybe some others???).
Multiple set of jobs (talking about Jenkins/Hudson here) per branch
Write tooling to manage the extra jobs
Create/modify/delete Jobs in bulk
Custom settings for each job per branch (SCM url, dep management repos duplications)
Some examples of people tackling this problem with shell tools, ant scripts and Jenkins CLI. See:
http://jenkins.361315.n4.nabble.com/Multiple-branches-best-practice-td2306578.html
http://jenkins.361315.n4.nabble.com/Is-it-possible-to-handle-multiple-branches-where-some-jobs-should-run-on-each-one-without-duplicatin-td954729.html
http://jenkins.361315.n4.nabble.com/Parallel-development-with-branches-td1013013.html
Configure or Create hudson job automatically
Will cause more load on your CI cluster
Feedback cycle for devs slows down (if the infrastructure cannot handle the new load)
Multiple set of jobs per 2 branches (dev & stable)
Manage the two sets manually (if you change the conf of a job then be sure to change in the other branch)
PITA but at least so few to manage
Other extra branches won't get a full test suite before they get pushed to dev
Unsatisfied devs. Why should a dev care about CI scaling problems. He has a simple request, when I branch I would like to test my code. Simple.
So it seems if I want to provide devs with CI for their own custom branches I need special tooling for Jenkins (API or shellscripts or something?) and handle scaling. Or I can tell them to merge more often to DEV and live without CI on custom branches. Which one would you take or are there other options?
When you talk about scaling CI you're really talking about scaling the use of your CI server to handle all your feature branches along with your mainline. Initially this looks like a good approach as the developers in a branch get all the advantages of the automated testing that the CI jobs include. However, you run into problems managing the CI server jobs (like you have discovered) and more importantly, you aren't really doing CI. Yes, you are using a CI server, but you aren't continuously integrating the code from all of your developers.
Performing real CI means that all of your developers are committing regularly to the mainline. Easy to say, but the hard part is doing it without breaking your application. I highly recommend you look at Continuous Delivery, especially the Keeping Your Application Releasable section in Chapter 13: Managing Components and Dependencies. The main points are:
Hide new functionality until it's finished (A.K.A Feature Toggles).
Make all changes incrementally as a series of small changes, each of which is releasable.
Use branch by abstraction to make large-scale changes to the codebase.
Use components to decouple parts of your application that change at different rates.
They are pretty self explanatory except branch by abstraction. This is just a fancy term for:
Create an abstraction over the part of the system that you need to change.
Refactor the rest of the system to use the abstraction layer.
Create a new implementation, which is not part of the production code path until complete.
Update your abstraction layer to delegate to your new implementation.
Remove the old implementation.
Remove the abstraction layer if it is no longer appropriate.
The following paragraph from the Branches, Streams, and Continuous Integration section in Chapter 14: Advanced Version Control summarises the impacts.
The incremental approach certainly requires more discipline and care - and indeed more creativity - than creating a branch and diving gung-ho into re-architecting and developing new functionality. But it significantly reduces the risk of your changes breaking the application, and will save your and your team a great deal of time merging, fixing breakages, and getting your application into a deployable state.
It takes quite a mind shift to give up feature branches and you will always get resistance. In my experience this resistance is based on developers not feeling safe committing code the the mainline and this is a reasonable concern. This in turn usually stems from a lack of knowledge, confidence or experience with the techniques listed above and possibly with the lack of confidence with your automated tests. The former can be solved with training and developer support. The latter is a far more difficult problem to deal with, however branching doesn't provide any extra real safety, it just defers the problem until the developers feel confident enough with their code.
I would set up separate jobs for each branch. I've done this before and it isn't hard to manage and set up if you've set up Hudson/Jenkins correctly. A quick way to create multiple jobs is to copy from an existing job that has similar requirements and modify them as needed. I'm not sure if you want to allow each developer to setup their own jobs for their own branches, but it isn't much work for one person (i.e. a build manager) to manage. Once the custom branches have been merged into stable branches, corresponding jobs can be removed when they are no longer necessary.
If you're worried about the load on the CI server, you could set up separate instances of the CI or even separate slaves to help balance the load across multiple servers. Make sure that the server you are running Hudson/Jenkins on is adequate. I've used Apache Tomcat and just had to ensure that it had enough memory and processing power to process the build queue.
It's important to be clear on what you want to achieve using CI and then figure out a way to implement it without much manual effort or duplication. There's nothing wrong with using other external tools or scripts that are executed by your CI server that help simplify your overall build management process.
I would choose dev+stable branches. And if you still want custom branches and afraid of the load, then why not move these custom ones to the cloud and let developers manage it themselves, e.g. http://cloudbees.com/dev.cb
This is the company where Kohsuke is now.
There is an Eclipse Tooling also, so if you are on Eclipse, you will have it tightly integrated right into dev env.
Actually what is really problematic is build isolation with feature branches. In our company we have a set of separate maven projects all be part of a larger distribution. These projects are maintained by different teams but for each distribution all projects need to be released. A featurebranch may now overlap from one project to another and thats when build isolation gets painfully. There are several solutions we've tried:
create separate snapshot repositories in nexus for each feature branch
share local repositories on dedicated slaves
use the repository-server-plugin with upstream repositories
build all within one job with one private repository
As a matter of fact, the last solution is the most promising. All other solutions lack in one or another way. Together with the job-dsl plugin it is easy to setup a new feature branch. simply copy and paste the groovy script, adapt branches and let the seed job create the new jobs. Make sure that the seed job removes nonmanaged jobs. Then you can easily scale with feature branches over different maven projects.
But as tom said well above, it would be nicer to overcome the necessity of feature branches and teach devs to integrate cleanly, but that is a longer process and the outcome is not clear with many legacy system parts you won't touch any more.
my 2 cents

How to migrate from "Arcane Integration" to Continuous Integration?

Right now a project I'm working on has reached a level of complexity that requires more than a few steps (actually its become arcane!) to produce a complete/usable product. And unfortunately we didn't start out with a Continuos Integration mindset, so as you can imagine its kind of painful at times, and at others I can easily waste half a day trying to get a clean/tested build.
Anyways as any HUGE project it consists of many components in many different languages (not only enterprise style Java or C# for example), as well as many graphical, and textual resources. Now the problem is that when I look for Continuos Integration, I always find best practices and techniques that assume one is starting a new project, from the ground up. However this isn't a new project, so I was wondering what are some good resources to proactively start migrating from Arcane Integration towards Continuos Integration :)
Thanks in advance!
Here it is in two simple (hah) steps.
Go for the repeatable build:
Use source control, get all code checked in.
Establish and document all tools used to build (mainly, which compiler version). Have a repeatable deployment and set up process for these tools.
Establish and document clearly any resources which are necessary to build, but are not checked in (third party installations, service packs, etc). Have a repeatable deployment and set up process for these dependencies.
Before commiting to source control, developers must
update their working copy
successfully build
run and pass automated tests
These steps can be done 1 at a time, sort of a path to follow. You'll get benefits at each stage. For example, if you aren't using source control at all, just getting the code into source control (without anything else) is a big step forward. Also, if there are no automated tests, then developers can't run them - but they can still get the prior commits and get the compiler to check their work.
If you can do all of these, you'll get to a nice sane place.
The goals are repeatable build processes and developers that are plugged in to how their changes affect the build and other developers.
Then you can reap the bonuses by establishing higher compliance:
Developers establish a frequent commit habit. Code that is in the working copy should never be more than 1 day old.
Automated build process monitors source control for check-ins and gets the results to a place where the users can accept them (such as a test environment, a preview website, or even simply placing an .exe where the user can find it).
The same way you eat an elephant (one bite at a time) ;-) Continuous integration requires an automated build. Start with that. Automate the building of each piece. Ant or NAnt is a great way to do this. Have each component's construction be a NAnt task. Then your entire system build can aggregate those individual tasks.
From there, you can add tasks for deployment, for unit testing, etc. If you want to use a CI technology, you can wire it up to your NAnt build.
I would start by first writing down all the steps it takes you to do the build and test manually. After that you at least have a guide for doing it the old way, and writing things down gives you the chance to look at it as a complete process.
Then look for parts to script.
Ideally you want to trigger a build and test from a code commit and only rebuild and retest the changed parts, with perhaps a full build and test nightly or weekly. You'll need log files or database entries and reports on the build success or lack of it.
You'll want to search out and evaluate pre-built products and open-source build-your-own kits. You can certainly write all the scripting and reporting yourself, but it will take a while and you'll probably end up with a just barely good enough reporting system since your job is coding the product, not coding the build system. :-)
I would guess that migrating isn't really an option--Half-ass solutions will only make it worse.
My approach would be to take one creative engineer who understands the build process, sit him down and say "Fix this". Give him a week or two.
The end goal would be a process that runs beginning to end with a single make command.
I also recommend an automated "Setup" procedure where you simply do a checkout and run a batch file from a network share to install and build all your tools. The amount of time this will save overall is staggering if you bring in new programmers. Most projects take one to three days to get set up on a new computer--and it's always the "new" programmer who doesn't know what's going on doing the installs on his own system...
In short: Incrementally
Choose a framework that will work across the diverse range of projects.
One by one, add components to the framework.
If you are not familiar with the framework, tackle a couple of the easier components first, to reduce risk of screwing up.
If you do understand the framework, tackle some of the more difficult and/or commonly built components first, so your team (and management) will appreciate the benefits early, and support the effort more.
Be sure to have a plan to include all of your components, because that's when the full benefit will be realized.
Bring your team with you; make sure you have consensus that this is going to be valuable, or people won't maintain it as the components change.

Resources