Transformation configuration - etl

What's the most proper and best practice driven way of configuration my transformations?
In other words let's imagine I have a big ETL solution based on kettle that does stuff by connecting to different data source, I would like to store these data sources in a centralized location and have each transformation look it up everytime it needs to connect somewhere.
In SSIS there is package configuration what is the alternative that I have with pentaho?
Ps: I do not want to install any 3rd party framework.
Thank you

This can be done in various ways.
Parameterising the database connections, and configuring the properties via kettle.properties. You could still access that kettle.properties from a shared area or something.
As above, but configuring the connections by reading credentials from a database. Has to be hand crafted, but can be made to work with some caveats.
If you use the repository, then the database connections are stored centrally anyway. So if you have a dev and a prd repo, when you promote, dont promote the db connection itself. Trickier than it sounds though.
As for all of that, the new 4.4(?) release should have proper lifecycle management to make dealing with all this stuff a lot easier!

Related

How to share database connection between in spring cloud

How can I share database connection aong in spring cloud module microservices. If there are many microservices how can i use same db connection or should i use db connection per microservices?
In my opinion, the thing that you've asked for is impossible only because each microservice is a dedicated process and it runs inside its own JVM (probably in more than one server). When you create a connection to the database (assuming you use connection pool) its always at the level of a single JVM.
I understand that the chances are that you meant something different but I had to put it on because it directly answers your question
Now, you can share the same database between microservices (the same schema, tables, etc) so that each JVM will have a set of connections opened (in accordance with connection pool definitions).
However, this is a really bad practice - you don't want to share the databases between microservice. The reason is the cost of change: if you (as a maintainer of microservice A) decide to, say, alter one of the tables, now all microservices will have to support this, and this is not a trivial thing to do.
So, a better approach is to have a service that has a "sole responsibility" for your data in some domain. Now, all the services could contact this service and ask for the required data through well-established APIs that should never be broken. In this approach, the cost of change is much "cheaper" since only this "data service" should be changed in a way that it doesn't break existing APIs.
Now regarding the database connection thing: you usually will have more than one JVM that runs the same microservice (like data microservice) so, it's not that you share connections between them, but rather you share the same way of working with database (because after all its the same code).
When dealing with a mircoservice architecture it is usually the case that you have a distributed system.
Most microservices that communicate with each other are not on the same machine, instance or container. Communication between them is most commonly done via http, though there are many other ways.
I would suggest designing mircoservices around a single concern of your application. For example, in your case, you could have a "persistence microservice" that would be responsible for dealing with data persistence operations on a single or multiple types data-stores. It could possibly deal with relational DBs, noSQL, file storage etc. Then, via REST endpoints, you can expose any persistence functionality to the mircoservices that deal with business logic.
A very easy way to build a REST service like this would be with the help of Spring Data REST project.
To answer your actual question, I'm not aware of any way to share actual connections between processes. Beyond that, having many microservices running on the same instance is not a good practice most of the time.
Mircoservices are very popular these days and everybody is trying to transition to them. My advice would be to make sure you don't "over-engineer" your project.
Hope I didn't misunderstand your question, but to be fair it is a little vague. If you could provide a longer more detailed description of your architecture and use case I can suggest more tools/frameworks you can use to achieve your cloudy goals.
First and most important - your microservice should be responsible for handling all data in a given business domain/bounded context. So the question is - 'Why do you need to share database connection between microservices and isn't this a sign you went too far with slicing your system?' Microservice is a tool and word 'micro' may be misleading a bit :)
For more reading I would suggest e.g. https://learn.microsoft.com/en-us/dotnet/standard/microservices-architecture/architect-microservice-container-applications/identify-microservice-domain-model-boundaries (don' t worry, it's general enough to be applicable also to Spring).

Ho to do blue/green deployment for hybris?

I want to deploy hybris builds with zero down time. Our technical architecture consist of two frontend servers, two backend servers, two master/slave solr clusters, but a single DB server (MS SQL 2012). A new build may require patch execution which changes the DB schema.
Would it be possible to achieve this in a single DB landscape?
If two DB's are required (blue and green), then what is the best practice for DB replication in case of hybris?
Hybris does provide a rolling update feature (when you're running it in a cluster environment).
This is targeted to allow for zero downtime.
You can find more information on the hybris help pages, e.g.
https://help.hybris.com/6.5.0/hcd/8c455268866910149b25f7b53d1af3e1.html
Looking at the first picture there it seems to be pretty much fitting for the architecture you describe.
(But to be honest I have no experience with it, so I can't tell you whether or or how well it works :) )
If you have risky changes or end up needing to rollback your rolled out update you will have to do quite a bit of db cleanup etc.
From that perspective a blue/green setup might sound better although with db replication you would end up with the same problem (as your updated schema would be replicated as well I assume).
Hybris only adding new columns to db, never change their type or remove them. So single DB can be OK. I didn't test this using store front while updating system. I think it will be OK.
On the other hand you need development for empty/null check for new attributes in development.

How to implement a Definitive Media Library (ITIL DML)?

How to implement a Definitive Media Library (ITIL DML) ?
I would like to know some way to implement a DML based on ITIL.
Given a library of heterogeneous software the only solution that crosses my mind is to use a system file structure (with proper security and access permissions), however this seems very simple and if the library gets too big it will be hard to find software that search.
Is there any specific software for DML?
Many tools that offer CMDB management also offer DML management. Some options for this are ServiceNow and IBM's Change and Configuration Management Database.
If you are only looking for DML functionality, a binary repository manager, such as Sonatype Nexus or Artifactory, provides metadata tagging, version control, and many other useful features. Implementing a binary repository manager and proper procedures for maintaining it serves as an excellent DML solution.
There is nothing wrong with a file system to store software in form of completed, tested (software) configuration items which passed appropriate quality assurance test, etc. But because of the controlled it environment in ITIL, it is mandatory to establish an control structure for tracking the appropriate information of every CI or software in the DML. This records have to contain all relevant informations like version, build date, development, release date, etc. Only tested, confirmed and quality checked, and deployable CIs should be hold in this system.
Because of the extensive meta-data, some years ago, we created a DML depending on postgreSQL because the handling and management for the CIs along with the mandatory tracking (time when inserted, time and logging of access, access control, maybe licenses, etc.) and meta-data is much easier in a sql database. Of course, we had to build the structure for the meta-data into the sql-db, but that was not too complicated and for our straightforward DML it was sufficient. Building and managing the metadata with each ci was supposedly easier for our system than installing, configuring, learning and managing a whole third-party DML/CML system. A caveat is when we had to save whole system images from deployed, tested, already integrated and checked systems because their size were several hundreds of GB (With the newest DB-versions, this should now also be possible, the question is, if it is useful within a SQL-DB) . But we stored the disk-images on separate disks and tracked the meta-data in our tailored DML-system (our postgreSQL) along with the information where to access it.
The advantage was, that we could easily duplicate the DML and take it for example to customer, where we only had to set up or run our (postges-based) DML and were able to access all relevant CIs we needed to set up an heterogeneous network of the target system.
In other cases, maybe it is easier to rely on already existing third party solution, but the idea of a DML can be fulfilled with every storage system as long as the appropriate proecedures, meta-data, informations and access points to the overall life-cycle management are provided and maintained.
regards

Appfabric Caching: Configuration Provider as single point of failure

After doing some initial research into using Appfabric for caching, my understanding is that the configuration provider for the cluster is a single point of failure as mentioned here:
MSDN
I want to use appfabric just for distributed caching, particularly for the tagging features. What are the options to avoid having the configuration provider as this failure point? I thought of two but not sure if one is better or if there are any other options.
(1) Create my own caching service configuration provider. I'm guessing this is possible (?) but I'm not sure how to go about it. I'd probably make a provider that fetched the xml file from S3 since I'm already using AWS.
(2) Configure each cache as a single node cluster and then create a proxy client that uses the individual nodes as a distributed cache, a la a memcached type client.
Thoughts or recommendations, or anything else I should consider in making this decision?
Yes, it is a single point of failure.
Microsoft's recomended solutions seem to be:
(SQL Server provider) Use SQL Server
clustering. In my limited
experience of it, using SQL Server
clustering for this is probably a
case of 'the cure is worse than the
disease' i.e. it brings a lot of
pain. Unless you've already got a SQL
Server cluster available, avoid!
(XML
provider) Use Windows Server
clustering. I have even less
knowledge of this than SQL
clustering, so I can't say how well (or otherwise)
this might work. It doesn't strike me as a trivial thing to do, though.
You can create your own configuration provider by implementing the ICustomProvider interface and making some registry entries. Using AWS seems like a really good idea to make the config provider resilient, I'd be interested to see how you got on with this.
Creating a proxy client seems to me like you'd be making a lot of work for yourself, at that point it feels like you'd be more fighting against AppFabric rather than working with it.
We have also tried AppFabric but it gave us fair few headaches like for one there's no API access which is making it very difficult to use our current unit testing strategy. We have now moved to NCache that is better option than AppFabric. NCache provides tagging feature and it is not a single point of failure.

What parts of application you prefer to be externalized as configuration and why?

What parts of your application are not coded?
I think one of the most obvious examples would be DB credentials - it's considered bad to have them hard coded. And in most of situations it is easy to decide if you want something to be externalized or coded.For me the rules are simple. Some part of the application should be externalized if:
it can and should be changed by non-developer, but not so often to be included in application settings defined in UI (DB credentials, service URLs, etc)
it does not require programming language and seems unnatural being coded (localization)
Do you have anything to add?
This is a little related to this question about spring cfg.
Spring configuration seems less obvious example for me, because in my practice it is never modified by anyone except the developer. And the road of externalizing can take you far away, to the entire project being "configured", not coded - so where to stop?
So please post here some examples from your experience, when you got benefit from having something configured, not coded - like dependency injection configuration in spring, etc.
And if you use spring - how often is configuration changed without recompiling?
Anything that needs to differ between different deployments of your application. That is, anything specific to the environment.
Examples include:
Database connection strings
URLs for web or WCF services
Logging configuration
Any information your application uses that is "data" and that could change depending on where it is installed. Things like:
smtp mail server used to send e-mails
Database connect strings
Paths to file locations / folders used by the app
FTP servers & connect info
Active Directory servers used for authentication
Any links displayed in the application to external information
sources
Warning limit values
I've even put the RegEx filters used to limit the allowable characters
for data entry fields.
Besides the obvious changing stuff (paths, servers, ports, and so on), some people argue that you should be able to easily change whatever might reasonably change, for instance, say you have a generic engine which operates on the business logic (a rule engine).
You would then define the rules on a "configuration file" which ends up being is no less than programming in a DSL instead of in the generic purpose language. Benefits being it's closer to the domain so it's easier and more maintainable, and that you can easily change things that otherwise would demand a new build.
The main argument behind this is that things you assumed would never change always end up changing nonetheless, so you better be prepared.
paths and server names/addresses come to mind..
I agree with your two conditions, which is why I:
Rarely include a config file as part of a Windows or Windows Mobile application (web apps yes).
If I did include a config file meant to be tweaked by end users, it certainly wouldn't be XML.
Employee emails/names since employees can come and go... (you should typically try to keep them out of an application though)
Configuration files should include:
deployment details
DB credentials
file paths
host names
anything that is used in many places but that may change
contact email addresses
options that aren't in the GUI
The last one is a bit open-ended, but very important. I've found it very useful to foresee variables that the client may, in the future, want to change. If changes are infrequent, I or they can edit the config file. If it becomes a frequent thing, it's trivial to add the option to the GUI, which isn't hardcoded.
I would also add encryption keys (which themselves should be encrypted)...
Basically the rule of thumb is information the application needs BEFORE it's regular, functional operation, data that it MUST have on-hand (i.e. local and not networked).
Note that this data should not be dynamically changing or large amounts of it, otherwise it should be in the database.
With Spring apps I actually distinguish between two types of configuration:
Items externalized into property files which are "deploy time" concerns or "environment-specific": server IP's / addresses, file system locations, etc etc
Spring XML configuration which can do lots of things, like indicate the overall application structure, apply behavior via AOP, etc.
I use Spring to wire all the beans in a J2SE application that has no GUI (a transactional switch). That way it's very easy for me to have different configurations in each deployment (we have this thing running in different countries), without having to code anything different.
Another thing I like to have is to manage all the SQL statements separately from the code, when I use plain JDBC (or Spring JDBC). Like in a properties file or XML or something, sometimes even as String properties in the beans that will use the statement (when there is only one bean that will use the statement, such as a DAO).
I am going to use spring JDBC or vanilla JDBC for data persistence, here we have decided to externalize all the SQL from the Java code, so can be better mangable in terms of SQL query tuning and optimization, we don't need to disturb the java code.

Resources