What are the required functionalities of ETL frameworks? - etl

I am writing an ETL (in python with a mongodb backend) and was wondering : what kind of standard functions and tools an ETL should have to be called an ETL ?
This ETL will be as general purpose as possible, with a scriptable and modular approach. Mostly it will be used to keep different databases in sync, and to import/export datasets in different formats (xml and csv) I don't need any multidimensional tools, but it is a possibility that it'll needed later.

Let's think of the ETL use cases for a moment.
Read databases through a generic DB-API adapter.
Read flat files through a similar adapter.
Read spreadsheets through a similar adapter.
Arbitrary rules
Filter and reject
Add columns of data
Profile Data.
Statistical frequency tables.
Transform (see cleanse, they're two use cases with the same implementation)
Do dimensional conformance lookups.
Replace values, or add values.
At any point in the pipeline
Or prepare a flat-file and run the DB product's loader.
Further, there are some additional requirements that aren't single use cases.
Each individual operation has to be a separate process that can be connected in a Unix pipeline, with individual records flowing from process to process. This uses all the CPU resources.
You need some kind of time-based scheduler for places that have trouble reasoning out their ETL preconditions.
You need an event-based schedule for places that can figure out the preconditions for ETL processing steps.
Note. Since ETL is I/O bound, multiple threads does you little good. Since each process runs for a long time -- especially if you have thousands of rows of data to process -- the overhead of "heavyweight" processes doesn't hurt.

Here's a random list, in no particular order:
Connect to a wide range of sources, including all the major relational databases.
Handle non-relational data sources like text files, Excel, XML, etc.
Allow multiple sources to be mapped into a single target.
Provide a tool to help map from source to target fields.
Offer a framework for injecting transformations at will.
Programmable API for writing complex transformations.
Optimize load process for speed.

Automatic / heuristic mapping of column names. E.g simple string mappings:
DB1: customerId
DB2: customer_id
I find a lot of the work I (have) done in DTS / SSIS could've been automatically generated.
not necessarily "required functionality", but would keep a lot of your users very happy indeed.


Data Flow process in ETL architecture

I need some clarity on how data will flow from source system to target system in a typical ETL data warehouse architecture.
For e.g. Source system, target system and ETL server are in three different networks and in ETL there are some transformations and logic applied. In this case whether data flows from source->ETL server->Target server or Source->Target with transformations applied on fly between them and data not flowing through ETL server?
In most situations (I can't think of an exception, but there must be some), the data moves from the source system to the ETL server and then to the target server. Transformations take place on the ETL server, which can often cause a bottleneck if that machine is under-powered or light on memory. If that turns out to be the case, an ELT approach may become necessary. Most ETL tools can easily accommodate that approach, though.
Anything more specific will depend on the specific ETL product you're using and your server architecture.
As you said, there are two different methods of ETL, Pipeline and multistage.
1- In pipeline method there is no ETL server or staging area and transformation -(including data cleansing, validation, format revision and etc) applied At the same time with Extract step and then the transformed data load on target server. In other hand, you may run transformation program on source or target server.
2- In multistage method you have at least 3 servers (or distinct spaces): source, staging and target. for example on a database, these can be 3 separate database servers or 3 schemas on a database. Anyway, transformation program should be ran in staging area. the area is the space that you should write extracted date over it and then some transformation will be applied on extracted data. in this area you may have many stages. for example you can write extracted data on stg1 tables or files. after that, transformation_step_1 will be applied on stg1 data and the transformed data will be written on stg2 tables or files.
According to the application you may need to apply transformation_step_2 on stg2 files and write transformed data on stg3 tables or files. this process can be continued until you apply all transformations. so you may call this process Multistage ETL.
I proffer multistage method because written program in this method is easier to debug and won't use full RAM. one of the disadvantages of this method is a lot of use of storage.

How can we do data analysis for DB replication project

We are facing one issue in our project i.e. Data verification issue.
The project is about Replication of data from Sybase to oracle DBs.
The table structures for Table A across Sybase, Oracle is same.
Same column and primary key combination across all the databases.
e.g. If Sybase has Table A with columns a, b and C
same table with same name and same columns will be available in different databses.
We are done with replication stuff part.But we faced some silent failure like data discrepancy just wondering if there will any tool already available for this.
Any information on his would be helpful. Thanks.
Sybase (now SAP) has a couple products that can be used for data comparisons and reconciliation:
rs_subcmp - an older, 32-bit tool that comes with the Sybase Replication Server product that can be used to compare data between
source and target; SQL reconciliation scripts can be generated from
the differences and then applied to the target to bring it in sync
with the source; if your tables are more than 1GB in size you can
still use rs_subcmp but you'll need to create multiple comparison
jobs (via where clauses) to work on different subsets of your tables
[I don't recall if rs_subcmp can be use for heterogeneous
replication setsup, eg, ASE-Oracle.]
Data Assurance (DA) - the newer, 64-bit product ... also from
Sybase ... which can also compare data and (re)sync the target(s)
from the source (either via SQL reconciliation scripts or directly);
DA is capable of handling comparisons between a handful of
different RDBMS products (eg, ASE-Oracle); I'm currently working on a
project where one of the requirements is to validate (and reconcile
where needed) 200+TB of data being migrated from Oracle to HANA and
I'm using DA for the validation/reconciliation portion of the project
As #TenG has hinted at with his answer, there's a good bit of effort involved to compare data and generate code to reconcile the differences. Rolling your own code is doable but will entail a lot of work. If you've got the money you'll likely find 3rd party tools can get most/all of the work done for you.
If you used a 3rd party product to replicate your data from Sybase to Oracle, you may want to see if the same vendor has a comparison/validation/reconciliation tool you could use.
I've worked on a few migration projects and a key part has always been data reconciliation.
I can only talk about the approaches we took, based on constraints around tools available and minimising downtime, and constraints of available space.
In all cases I took to writing scripts that worked on two levels - summary view and "deep dive". We couldn't find any tools readily available that did what we wanted in a timely enough manner. In fact even the migration tools we found had limitations (datapump, sqlloader, golden gate, etc) and hand coded scripts to handle the bits that we found to be lacking or too slow in the standard tools.
The summary view varied from project to project. It was part functional based (do the accounting figures for transactions match) for the users to verify, and part technical. For smaller tables we could just write simple reports and the diff was straight forward.
For larger tables we wrote technical reports that looked at bands of data (e.g group the PK into 1000s) collect all the column data and produce checksum, generating a report for each table like:
PK ID Range Start Checksum
----------------- -----------
100000 22773377829
200000 38938938282
Corresponding table pairs from each database were then were "diff"d against each other to highlight discrepancies. Any differences that were found could then be looked at in more detail.
The scripts were written in such a way to allow them to run in parallel looking at discrete bands. Te band ranges were tunable as well to get the best throughput. This obviously sped things up.
The scripts were shell scripts firing off sqlplus reports, and similar for the source database.
On one project there wasn't enough diskspace to do these reports, so I wrote a Java program that queried the two databases side by side, using block queues to fetch and compare rowsets. Being in memory meant this was super fast.
For the "deep dive" we looked at the details for key tables, or for tables that reports a checksum difference.
For the user reports, the users would specify what they wanted to see, and we wrote the reports accordingly.
On the last project, the only discrepancies found were caused by character set conversion issues (people names with accents weren't handled correctly).
On projects where the overall dataset was smaller we extracted the data to XML files and wrote a Java tool to processes pairs and report differences.
The SAP/Sybase rs_subcmp tool is pretty powerful and also pretty hard to use. For details see:
You have to pass it key field information, but once you do that, it can retry/restart the compare streams after transient differences. Pretty fancy.
rs_subcmp expects to work on Sybase data source. So to compare against Oracle, you'd probably have to setup one of those Sybase-to-Oracle gateway products ($$$$$).
Could you install the Oracle ODBC drivers and configure them to allow Sybase clients to access Oracle? I'm guessing not (but that's outside the range of my experience).
Note the "-h" option for rs_subcmp. The docs just say it runs a "fast comparison", but what it's actually doing is running queries using the hashbytes() function. Something like:
select keyfield1,keyfield2, hashbytes("Md5",datacol1,datacol2,datacol3)
from mytable
So this sort of query might be good for the "summary view" type comparison discussed above (if the Oracle STANDARD_HASH() function output matches up with the Sybase hashbytes() function (again, outside my experience))
Note, as of ASE 16, there was a bug with the hash() & hashbytes() functions running the Md5 hash option against large varbinary columns where they could use up all procedure cache, potentially crashing the server (CR 811073)

Big Data transfer between different systems

We have different set of data into different systems like Hadoop, Cassandra, MongoDB. But our analytic team want to get the stitched data from different systems. For example customer information with demographic will be in one system, their transactions will be in another system. Analytic should able to query to get data like from US users what was the volume of transaction. We need to develop an application to provide ease way to interact with different system. What is the best way to do?
Another requirement:
If we want to provide their custom workspace in a system like MongoDB, they can easily place with it. What is the best strategy to pull data from one system to another system on demand?
Any pointer or common architecture used to solve this kind of problem will be really helpful.
I see two questions here:
How can I consolidate data from different systems into one system?
How can I create some data in Mongo for people to experiment with?
Here we go ... =)
I would pick one system and target that for consolidation. In other words, between Hadoop, Cassandra and MongoDB, which one does your team have the most experience with? Which one do you find easiest to query with? Which one do you have set up to scale well?
Each one has pros and cons to scale, storage and queryability.
I would pick one and then pump all data to that system. At a recent job, that ended up being MongoDB. It was easy to move data to Mongo and it had by far the best query language. It also had a great community and setting up nodes was easier than Hadoop, etc.
Once you have solved (1), you can trim your data set and create a scaled down sandbox for people to run ad-hoc queries against. That would be my approach. You don't want to support the entire data set, because it would likely be too expensive and complicated.
If you were doing this in a relational database, I would say just run a
select top 1000 * from [table]
query on each table and use that data for people to play with.

Free data warehouse - Infobright, Hadoop/Hive or what?

I need to store large amount of small data objects (millions of rows per month). Once they're saved they wont change. I need to :
store them securely
use them to analysis (mostly time-oriented)
retrieve some raw data occasionally
It would be nice if it could be used with JasperReports or BIRT
My first shot was Infobright Community - just a column-oriented, read-only storing mechanism for MySQL
On the other hand, people says that NoSQL approach could be better. Hadoop+Hive looks promissing, but the documentation looks poor and the version number is less than 1.0 .
I heard about Hypertable, Pentaho, MongoDB ....
Do you have any recommendations ?
(Yes, I found some topics here, but it was year or two ago)
Other solutions : MonetDB, InfiniDB, LucidDB - what do you think?
Am having the same problem here and made researches; two types of storages for BI :
column oriented. Free and known : monetDB, LucidDb, Infobright. InfiniDB
Distributed : hTable, Cassandra (also column oriented theoretically)
Document oriented / MongoDb, CouchDB
The answer depends on what you really need :
If your millions of row are loaded at once (nighly batch or so), InfiniDB or other column oriented DB are the best; They have great performance and are "BI oriented". http://www.d1solutions.ch/papers/d1_2010_hauenstein_real_life_performance_database.pdf
And they won't require a setup of "nodes", "sharding" and other stuff that comes with distributed/"NoSQL" DBs.
If the rows are added in real time.. then column oriented DB are bad. You can either choose two have two separate DB (that's my choice : one noSQL for real feeding of the stats by the front, and real time stats. The other DB column-oriented for BI). Or turn towards something that mixes column oriented (for out requests) and distribution (for writes) / like Cassandra.
Document oriented DBs are not suited for BI, they are more useful for CRM/CMS issues where you need frequent access to a particular row
As for the exact choice inside a category, I'm still undecided. Cassandra in distributed, and Monet or InfiniDB for CODB, are leaders. Monet is reported to have problem loading very big tables because it runs indexes in memory.
You could also consider GridSQL. Even for a single server, you can create multiple logical "nodes" to utilize multiple cores when processing queries.
GridSQL uses PostgreSQL, so you can also take advantage of partitioning tables into subtables to evaluate queries faster. You mentioned the data is time-oriented, so that would be a good candidate for creating subtables.
If you're looking for compatibility with reporting tools, something based on MySQL may be your best choice. As for what will work for you, Infobright may work. There are several other solutions as well, however you may want also to look at plain-old MySQL and the Archive table. Each record is compressed and stored and, IIRC, it's designed for your type of workload, however I think Infobright is supposed to get better compression. I haven't really used either, so I'm not sure which will work best for you.
As for the key-value stores (E.g. NoSQL), yes, they can work as well and there are plenty of alternatives out there. I know CouchDB has "views", but I haven't had the opportunity to use any, so I don't know how well any of them work.
My only concern with your data set is that since you mentioned time, you may want to ensure that whatever solution you use will allow you to archive data past a certain time. It's a common data warehouse practice to only keep N months of data online and archive the rest. This is where partitioning, as implemented in an RDBMS, comes in very useful.

Oracle streams and denormalization

I intend to use Oracle Streams for replication from Source to Target. The Target will be used mainly to run Reports. Earlier, all the reports used to run on the Source itself. Therefore, this arrangement gives better performance as all report queries are directed to a dedicated Target.
I would also like to denormalize the tables on the Target to achieve better reports performance. Can denormalization be done in conjunction with Streams replication ? I know that Oracle Streams allows us to write our own dequeue process. But is there a simple "GUI"-based way to achieve de-normalization on the fly ... as and when Streams replicated the data ? Any pointers would be very helpful.
I think the cleanest way to denormalize would be to leave the Streams replication intact (with 1->1 mappings of the tables) and create materialized views on the target tables that handle the transformations you need.
I think GUI interfaces to these types of transformations get cumbersome quickly as the logic gets more complicated, but if you really want a GUI solution you can look at Oracle Warehouse Builder. Once the GUI-driven design is complete within OWB, you can generate PL/SQL packages to perform the ETL.
