ETL tool migration : Best Practices in Parallel Run - parallel-processing

I am new to ETL migration. I have worked with Talend, but not yet faced the task to migrate large ETL project from one tool to another(IBM Data Manager to Informatica PowerCenter or Informatica Developer).
I am looking to general guidlines for migrate jobs from one tool to another one, and of course for my specific case.
I will be more clear:
The Databases Sources and Targes will be the same, what I have to migrate is the ETL part itself.
The approach will be the parallel run as suggested at this blog :
Parallel Run
In my case I have not to migrate the all DWH instead only the ETL as the old software will become a legacy one and the new one is from another Vendor(luckly both of them can export XML ).
I am looking for the pratical approch for parallel run, indeed I am been suggested to copy the Sources and Targes Tables in the orginal Database schema, but it does not look to me the best way to go(even not pratical when a schema has many tables).
The DWH I am woking of course has several DBS instance in Oracle and some in SQL Server, a Test server and a Production one, as well as for each, a Staging, Storage and a Data Mart area.
As from this related question and its answer, I am thinking to copy each schema on the go for each project.
Staging in ETL: Best Practices
Looking to have guidlines references, but my specific case is the migration from IBM Data Manager to Informatica PowerCenter

The approach depends on various criteria and personal preferences. Either way you will need to either duplicate parts or all of the source and destination systems. On one extreme you can use two instances of the entire system. If you have complex upstream processes that are part of the test, or you have massive numbers of tables and processes, and you have the bandwidth and resources to duplicate your system then this approach may be optimal.
At the other extreme, if any complex processes occur within the ETL tool itself, or you are simply loading tables and need to check they are loaded correctly, then making copies of the tables and pointing your new or old tool to the table copies may be the way to go. This method is very simple and easy to setup.
Keep in mind this forum is not meant to replace blogs and in-depth tech articles on those techniques.

Related

Using Saiku-ui without saiku-server/mondrian?

Is it possible to use the saiku-ui component with a different jolap provider than mondrian, or with a different server backend than the saiku-server component?
I have been looking but I have not found an architecture description of how these pieces fit together and what interfaces they use to communicate. Can anyone point me towards an understanding of what the saiku-ui wants to speak with and what the saiku-server is providing?
The reason for my interest is that I have a set of data spread across hundreds of csv files that I would like to query with a pivot and charting tool. It looks like the standard way to use this with saiku would be to have an ETL process to load in to an RDBMS. However, this would not be a simple process because the files and content and the way the files relate to each other vary, so the ETL would have to do a lot of inspection of the data sources to figure it out.
Given this it seems to me that I would have three options in how to use saiku:
1) write a complex ETL to load in to a rdbms, and then use a standard jdbc driver to provide the data to modrian. A side function of the ETL would be to analyze the inputs and write the mondrian schema file describing the cubes.
2) write a jdbc driver to access the data natively. This driver would parse sql and provide access to the underlying tables. Essentially this would be a custom r/o dbms written on top of the csv files. The jdbc connection would be used by mondrian to access the data. A side function of this custom dbms would be to produce the mondrian schema file.
3) write a tool that provides a jolap interface to the native data (accepts discovery and mdx queries). This would bypass mondrian entirely and interface with the ui.
I may be a bit naive here but I consider each of the three options to be feasible. Option #1 is my least preferred because of the likelihood of the data in the rdbms becoming out of sync with the cvs files. Option #3 is most preferred because the data are simple, so not much aggregating required and I suspect that mdx will be easier to parse than sql.
So, if i could produce my own jolap data source would it be possible to hook the saiku-ui tools up to it? Where would I look to find out the interface configuration details?
Many years ago, #ronaldbouman created xmondrian - the set of tools with the olap server, and web ui tools for xmla browsing and visualusation. But that project was not updating, and has no source code.
I just updated olap server and libraries to the latest versions.
You may get it here and build:.
https://github.com/Muritiku/xmondrian-build.
You may use web package as the example. The mondrian server works with the saiku-ui.
IMHO,
I would not be as confident as your are, because it took Julian Hyde more than a decade to build Mondrian (MDX->SQL) and Calcite (SQL), fulfilling your last two proposals.
You might simply consider using Calcite, or even better Dremio. Dremio has a JDBC interface, and can query directories of CSV files in SQL. I tested Saiku over Dremio successfully (with a schema based on two separate RDBMS). Just be careful to setup tables' schema accordingly in the Mondrian v4 schema.
Best regards,
Fabrice Etanchaud
Dremio

how to design greenplum database constructure

i am working on designing constructure in Greenplum database.
we have many clinets which need to store data for them.
there are two ways to design database constructure. we build one database and different schemas in this database for each clients
or build different databases for each clients, which way is better?
waht is nmore ,we need to migrate databases or schemas from dev environment to environment
Thanks
William
William,
Either way will work. If you are keeping multi-tenants in Greenplum and there is no data sharing, you might be better off keeping them in separate databases - easier for security and for backups. If there is a requirement that they share some common data, then using multiple schemas in one database is the better option.
I am not sure what version of Greenplum you are on, but you should be able to backup a schema from dev and restore it with gprestore using the --redirect option to put it in the database you want it to be in.
Jim McCann
Pivotal

Developer sandboxes for Oracle database

We are developing a large data migration from Oracle DB (12c) to another system with SSIS. The developers are using a production copy database but the problem is that, due to the complexity of the data transformation, we have to do things in stages by preprocessing data into intermediate helper tables which are then used further downstream. The problem is that all developers are using the same database and screw each other up by running things simultaneously. Does Oracle DB offer anything in terms of developer sandboxing? We could build a mechanism to handle this (e.g. have dev ID in the helper tables, then query views that map to the dev), but I'd much rather use built-in functionality. Could I use Oracle Multitenant for this?
We ended up producing a master subset database of select schemas/tables through some fairly elaborate PL/SQL, then made several copies of this master schema so each dev has his/her own sandbox (as Alex suggested). We could have used Oracle Data Masking and Subsetting but it's too expensive. Another option for creating the subset database wouldn have been to use Jailer. I should note that we didn't have a need to mask any sensitive data.
Note. I would think this a fairly common problem so if new tools and solutions arise, please post them here as answers.

Storing DDL scripts into a clob

Using Toad for Oracle 10.6 with DB Admin Add-in. For our migration process Dev to QA to Prod, we are starting to use the Schema Compare Module to generate the Sync Script DDL. After execution, we want to store the Sync Scripts for historical purposes. Due to policy, we are unable to copy these much of anyplace. Even the Windows server where toad runs is write-restricted.
I am thinking I could create a table with a CLOB column and store the scripts there, unless you folks tell me that this is a Really Bad Idea. I am looking for any tips on things like handling embedded special characters, or any other pitfalls that I may encounter.
Thanks,
JimR
Doing the best with what you have is a good thing. If management won't let you have version control, then jamming it in the database is better than nothing at all. The important thing is to create a process, document it, and follow it for deployments.
As for using schema differences to deploy, take a look at:
http://thedailywtf.com/Articles/Database-Changes-Done-Right.aspx

ArcSDE Database Refactoring Tool options

We are using liquibase as evolutionary DB change management tool in our applications, it works great when we use it in "common" database schemas.
But we also work with GIS applications using esri arcSDE 9.3 platform over Oracle and in this case, all (or almost all) tables (both GIS and 'alphanumeric' tables) in the schema are managed (create table, grants, etc.) through arcSDE. So when we want to create new feature classe now we use arcCatalog, and this way, it's not possible to manage the feature classes’ changes directly through SQL using liquibase or other automate refactoring tool.
So if we cannot use liquibase to manage changes, at least we want to execute the management operations over our features through command line.
We’ve started looking for tools that avoid the use of arcCatalog, and then try to automate the changes using scripts, we are investigating these possibilities:
Try to capture the SQL that arcCatalog/arcSDE is executing each time we make a change in one Feature Class monitoring the oracle connection. It outcomes us a too complex set of SQL instructions that involves indexes, versioning tables, etc. so we give up this way.
Use the sdelayer and sdetable admin commands installed on arcSDE server.
Use the data management tool: a python based library to manage the feature classes, but it has to be executed from a machine with the desktop version installed.
These last two options will provide a way to manage features from command-line, but our target is to find/develop a tool to manage changes similar to the way liquibase do. But with these tools we will have to find a tool that let us map each SQL DDL operation to an arcSDE command, and currently no db refactoring tool provides this (currently we have check liquibase, dbdeploy, flyway).
Anybody had solved this problem of evolutionary change management with arcSDE? Any insight of another way to face this problem?
I'll take a stab at this, although I'm unfamiliar with one of the products you mention (liquibase specifically - I have used Oracle and I'm very familiar with ArcGIS (ArcMap & ArcCatalog).
Here's just some additional information that may help and my interpretation of your question.
My interpretation - "What's a simple way to manage or enable us to automate the management of our tables of GIS data in our Oracle database without having to use ArcCatalog all the time?"
So - I'll throw this concept back in the ring - I know SQL Server has spatial datatypes "geometry", etc. and that you can bypass SDE and let ArcGIS directly connect and interpret this data without even installing SDE. I also know Oracle has compatible spatial types. So I would possibly consider migrating my data from the managed FC's that ArcCatalog creates and push them into oracle-native geometry based tables. This way you can treat them like regular tables, cut ESRI out of the solution, and manage them with liquibase, etc. Hopefully that helps.
I would also consider upgrading to 10.1 or at least 10.0 (I promise I'm not an undercover salesman), although that will require your users to come with you on the client side (http://resources.arcgis.com/en/help/main/10.1/index.html#//002q000000n8000000) because the newer python APIs are much easier and faster to use (arcpy vs. the GP model), if you do choose to use Python to manage your stuff. (Regardless, either API isn't very well developed and isn't intuitive to code in or fast.)
Good luck.

Resources