I work on an product that imports data from a mainframe using SSIS via flat file. The SSIS packages use a stage database to transform flat file data and then call stored procedures in the ODS to load the transformed data. There is a potential plan to route all ETL data through a .NET service layer (instead of directly to the ODS via stored procedures) to centralize business rules/activity, etc. I'm looking for input on this approach and dissenting opinions.
Sounds fine; you're turning basic ETL into ETVL, adding a "validate" step. Normally this is considered part of the "transform" stage, but I prefer to keep that stage purer when I conceptualize an architecture like this; transform is turning the raw fields which were pulled out and chopped up in the extract stage into objects of my domain model. Verifying that those objects are in a valid state for the system is validation.
Related
I need some clarity on how data will flow from source system to target system in a typical ETL data warehouse architecture.
For e.g. Source system, target system and ETL server are in three different networks and in ETL there are some transformations and logic applied. In this case whether data flows from source->ETL server->Target server or Source->Target with transformations applied on fly between them and data not flowing through ETL server?
In most situations (I can't think of an exception, but there must be some), the data moves from the source system to the ETL server and then to the target server. Transformations take place on the ETL server, which can often cause a bottleneck if that machine is under-powered or light on memory. If that turns out to be the case, an ELT approach may become necessary. Most ETL tools can easily accommodate that approach, though.
Anything more specific will depend on the specific ETL product you're using and your server architecture.
As you said, there are two different methods of ETL, Pipeline and multistage.
1- In pipeline method there is no ETL server or staging area and transformation -(including data cleansing, validation, format revision and etc) applied At the same time with Extract step and then the transformed data load on target server. In other hand, you may run transformation program on source or target server.
2- In multistage method you have at least 3 servers (or distinct spaces): source, staging and target. for example on a database, these can be 3 separate database servers or 3 schemas on a database. Anyway, transformation program should be ran in staging area. the area is the space that you should write extracted date over it and then some transformation will be applied on extracted data. in this area you may have many stages. for example you can write extracted data on stg1 tables or files. after that, transformation_step_1 will be applied on stg1 data and the transformed data will be written on stg2 tables or files.
According to the application you may need to apply transformation_step_2 on stg2 files and write transformed data on stg3 tables or files. this process can be continued until you apply all transformations. so you may call this process Multistage ETL.
I proffer multistage method because written program in this method is easier to debug and won't use full RAM. one of the disadvantages of this method is a lot of use of storage.
I know that my overall problem is generally approached using two of the more common solutions such as a join data set or a sub-table, sub-report. I have looked at those and I am not sure this will work effectively.
Background:
JDBC data source has local data which includes a series of id's that reference a record in a master data repository interfaced via a web service. This is where the need for a scripted data source arises. The data can be filtered on either attributes within the local JDBC data and/or the extended data from the web service. The complication is that my only interface is the id argument to the webservice.
Ideal Solution:
Aside from creating a reporting table or other truly desirable scenarios I am looking to creating a unified data source through a single scripting data source that will handle all the complexities. This leaves the report generation and parameter creation a bit cleaner, hopefully. The idea is to leverage the JDBC query as well as the web service queries in the scripted data source do the filtering and joins and create that singular unified view.
I tried using the following code as a reference to use the existing JDBC connection in the BIRT report definition to execute the query. However if I think my breakdown on what should be in open vs fetch given this came from beforeFactory for a completely different purpose may be giving me errors...truth is I see no errors it just returns 0 records.
a link
I have also found a code snippet to dynamically load a JDBC connection but that seems a bit obtuse and a ton of overhead for what I am needing to do. a link
In short: How in all-that-is-holy do you simply run a query against a database within a scripted data source if you wanted to do. The merit of doing that is another issue, but technically how?
Thanks in Advance!
So I was thinking... Imagine you have to write a program that would represent a schedule of a whole college.
That schedule has several dimensions (e.g.):
time
location
indivitual(s) attending it
lecturer(s)
subject
You would have to be able to display the schedule from several standpoints:
everything held in one location in certain timeframe
everything attended by individual in certain timeframe
everything lecturered by a certain lecturer in certain timeframe
etc.
How would you save such data, and yet keep the ability to view it from different angles?
Only way I could think of was to save it in every form you might need it:
E.g. you have folder "students" and in it each student has a file and it contains when and why and where he has to be. However, you also have a folder "locations" and each location has a file which contains who and why and when has to be there. The more angles you have, the more size-per-info ratio increases.
But that seems highly inefficinet, spacewise.
Is there any other way?
My knowledge of Javascript is 0, but I wonder if such things would be possible with it, even in this space inefficient form.
If not that, I wonder if it would work in any other standard (C++, C#, Java, etc.) language, primarily in Java...
EDIT: Could this be done by using MySQL database?
Basically, you are trying to first store data and then present it under different views.
SQL databases were made exactly for that: from one side you build a schema and instantiate it in a database to store your data (the language is called Data Definition Language, DDL), then you make requests on it with the query language (SQL), what you call "views". There are even "views" objects in SQL databases to build these views Inside the database (rather than having to the code of the request in the user code).
MySQL can do that for sure, note that it is possible to compile some SQL engine for Javascript (SQLite for example) and use local web store to store the data.
There is another aspect to your question: optimization of the queries. While SQL can do most of the request job for your views. It is sometimes preferred to create actual copies of the requests results in so called "datamarts" (this is called de-normalizing a request), so that the hard work of selecting or computing aggregate/groups functions and so on is done once per period of time (imagine that a specific view changes only on Monday), then requesters just have to read these results. It is important in this case to separate at least semantically what is primary data from what is secondary data (and for performance/user rights reasons, physical separation is often a good idea).
Note that as you cited MySQL, I wrote about SQL but mostly any database technology could do that what you searched to do (hierarchical, object oriented, XML...) as long as the particular implementation that you use is flexible enough for your data and requests.
So in short:
I would use a SQL database to store the data
make appropriate views / requests
if I need huge request performance, make appropriate de-normalized data available
the language is not important there, any will do
I am writing an ETL (in python with a mongodb backend) and was wondering : what kind of standard functions and tools an ETL should have to be called an ETL ?
This ETL will be as general purpose as possible, with a scriptable and modular approach. Mostly it will be used to keep different databases in sync, and to import/export datasets in different formats (xml and csv) I don't need any multidimensional tools, but it is a possibility that it'll needed later.
Let's think of the ETL use cases for a moment.
Extract.
Read databases through a generic DB-API adapter.
Read flat files through a similar adapter.
Read spreadsheets through a similar adapter.
Cleanse.
Arbitrary rules
Filter and reject
Replace
Add columns of data
Profile Data.
Statistical frequency tables.
Transform (see cleanse, they're two use cases with the same implementation)
Do dimensional conformance lookups.
Replace values, or add values.
Aggregate.
At any point in the pipeline
Load.
Or prepare a flat-file and run the DB product's loader.
Further, there are some additional requirements that aren't single use cases.
Each individual operation has to be a separate process that can be connected in a Unix pipeline, with individual records flowing from process to process. This uses all the CPU resources.
You need some kind of time-based scheduler for places that have trouble reasoning out their ETL preconditions.
You need an event-based schedule for places that can figure out the preconditions for ETL processing steps.
Note. Since ETL is I/O bound, multiple threads does you little good. Since each process runs for a long time -- especially if you have thousands of rows of data to process -- the overhead of "heavyweight" processes doesn't hurt.
Here's a random list, in no particular order:
Connect to a wide range of sources, including all the major relational databases.
Handle non-relational data sources like text files, Excel, XML, etc.
Allow multiple sources to be mapped into a single target.
Provide a tool to help map from source to target fields.
Offer a framework for injecting transformations at will.
Programmable API for writing complex transformations.
Optimize load process for speed.
Automatic / heuristic mapping of column names. E.g simple string mappings:
DB1: customerId
DB2: customer_id
I find a lot of the work I (have) done in DTS / SSIS could've been automatically generated.
not necessarily "required functionality", but would keep a lot of your users very happy indeed.
I am building a DataAccess layer to a DB, what data structure is recommended to use to pass and return a collection?
I use a list of data access objects mapped to the db tables.
I'm not sure what language you're using, but in general, there are tradeoffs of simplicity vs extensibility.
If you return the DataSet directly, you have now coupled yourself to database specific classes. This leaves little room for extension - what if you allow access to files or to other types of data sources? But, it is also very simple. This is the recordset pattern and C#/VB provide a lot of built-in support for this. The GUI layer can access the recordset and easily manipulate the data. This works well for simple applications.
On the other hand, you can wrap the datasets in a custom object, and provide gateway methods (see the Gateway pattern http://martinfowler.com/eaaCatalog/gateway.html). This method is more complex, but provides a lot more extensibility. In a larger application when you need to separate the the business logic, data logic, and GUI logic, this is a more robust way to go.
For larger enterprise applications, you can look into using Object Relational Mapping tools (ORM). They help to automatically map java objects to database tables. They hide a lot of the painful SQL details. Frameworks such as Spring provide excellent support for ORMs.
I tend to use arrays of objects, so that I can disconnect the DAO from the business logic.
You can store the data in the DAO as a dataset, for example, and give them an easy way to add to the database before doing an update, so they can pass in information to do modification operations, and then when they want to commit the changes they can do it in one shot.
I prefer that the user can't add/modify the structure themselves, as it makes it harder to determine what must be changed in the database.
By initially returning an array they can then display what is in the database.
Then, as the presentation layer makes changes, the DAO can be updated by the controller. By having a loose coupling the entire system becomes more flexible, as you can change the DAO from a dataset to something else, and the rest of the application doesn't care.
There are two choices that are the most generic.
The first way to look at a ResultSet is as a List of Maps, where each Map represents a row in the ResultSet. The keys are the columns listed in the FROM clause; the values are the database values.
The second way to look at a ResultSet is as a Map of Lists, where each List represents a column in the ResultSet. The Map keys are the columns listed in the FROM clause; the values are the List of database values.
If you don't want to do full-blown ORM, these can carry you a long way.