Get data's source in kettle - etl

When I use kettle , I was wandering how to get a table column's source column. Just for an example , after I have merged two tables into one table based on primary key already , Given any column in output table , I could judge whether table it belongs to and get the original column name in original table. Thank you for helping and sorry for my poor English...
http://i.stack.imgur.com/xoR0s.png
When I was given any field in table3 (suppose a field named A in table3) , I could know where it comes from without the graphical view (from java code or other ways) , like the original table name (here are input1 or input2) and the original column name(maybe B in input1 , but represents A in table3). Besides I use mysql.

There are a couple of ways to do this:
1) Manually. If you right-click on the output step and choose Show Output fields (or whatever it's called), you will see the "origin step" for each of the outgoing fields. You can do the same for input fields. Then you can trace them back to those origin steps, and repeat the process of viewing the input fields at those steps, and seeing those fields' origins, and so on. This is probably not what you're looking for.
2) With code. Prior to 6.0, you'd need to programmatically perform the same operations as are listed in option 1 above. In 6.0 there is the Data Lineage capability, which offers the LineageClient API that can find the origin fields for the specified output fields. For more information see my blog post describing the Data Lineage capability. Also I put a Gremlin Console in the PDI Marketplace, to make the use of LineageClient easier (and you can visually see the lineage graph too).

Related

How to make groups in an input and select a specific row in each of them in Talend?

I am working on a Talend transformation process (we are using Talend 6.4).
, and I don't know how to implement the current requirement.
I have an input consisting in :
Two columns that are my group keys (Account and Product), but are not unique (the same Account x Product couple can happen in multiple rows)
A criterion column (Contract end date), which will help me decide which row I want to keep for each group
Some "tail" data that need to be passed to the following step of the processing (the contract number)
The rule to implement is:
Keep only one record per group
The selected record must be one with no end date or, if all have end date, with the biggest end date
The selected record can be random in case there is a tie
See the transformation applying those rules on some dummy data:
I thought first to do the following:
sort by Account, Product, End_date (nulls first)
"select first" in each group
but I am not skilled enough to know whether the second transformation exists in Talend.
Regards,
Pierre
Very interesting Talend question.
You need to create something like this job.
here a link to the zip file to import in your Talend
The answer from #MBDIA seem to be working, however I would like to share what we did to fulfill our requirement.
See our Talend process here:
The first tMap (tMap_3) acts like a tReplicate and a tMap, and sends:
in the upper branch only the Account and Product references, that are then deduplicated by the tAggregateRow_1.
in the lower branch all data and computed fields that enables us to take care of the case where the date is missing (instead of defaulting to 31/12/9999, we compute a flag (0 or 1) that we use in the sort step afterwards).
In the second part of the process, we first apply the sort to the whole data on Account, Product, Empty date flag (computed before), End date (desc) and use a second tMap to make a join on both branches (on Account x Product), only keeping First Match in order to keep the first record as per our requirement.

Oracle - build dimension from a file based data source

I'm trying to build a star schema in Oracle 12c. In my case my data source is not a relational database but a single excel/csv file which is populated via a google form, which means I don't have any sort of reference from a source system such as auto incremental keys/ids. Now what would be the best approach to build a star schema given this condition?
File row sample:
<submitted timestamp>,<submitted by user>,<region>,<country>,<branch>,<branch location>,<branch area>,<branch type>,<branch name>,<branch private? yes/no value>,<the following would be all "fact" values (measurements),...,...,...
In case i wanted to build a "branch" dimension, how would I handle updates/inserts after the first load into the dimension table?
Thought solution so far:
I had thought of making a concatenated string "key" with the branch values, which would make it unique (underscore would be the "glue" to concatenate the values), eg:
<region>_<country>_<branch>_<branch location> as branch_key
I would insert all the distinct branches into a staging table, including they branch_key column for each one of them, then when trying to load into the dimension I could compare which key does not exists yet in my dimension table and then insert it. As for updates, I'm a bit stuck on how to handle that, I had thought of having another file mapping which branches are active having a expiration date column. Basically trying to simulate what I could do having the data in a database instead of CSV files.
This is all I can think of so far, do you have any other recommendations/ideas on how to implement this? Take on consideration that the data source cannot as in I have to read these csv files, since data is not stored anywhere else.
Thank you.

Is it possible to reverse a column transformation in Spotfire, and if not, what are the alternatives?

I've made the mistake of using the 'Calculate and Replace Column' feature to replace the wrong column, and realized after the fact. The column I replaced corresponds to last names and is important. I would like to retrieve this column but maintain my other 15 or so data transformations. Ideally, I would like to remove this transformation, but I've come up empty so far. Here's what I've tried:
I tried adding the 'last name' column again from the same external source, using >Insert >Columns... I also tried renaming this column to avoid the data transformation. Unfortunately, this resulted in an entirely empty column, so it did not successfully match to the table or was affected by the transformation..
I checked the source information, and found exactly the 3-4 lines that I wish were not there. I thought it might be possible to edit this but haven't found a way. This seems like it would be the easiest.
Another idea I had was I could replace the data table with the same source, and repeat all of the transformations from the replace data table dialogue (excluding the bad one). This is my next plan of attack, but I figured I would come on here to see if there's an easier way first.
Thanks in advance!
Good News for YOU!!! #jeremyVollen.
It is possible to 'edit' your transformation per Tibco article 44098.
Resolution: If there are more then one transformations on a data table and you need to edit any of those transformation, follow the steps below:
Go To Edit >> Data Table Properties.
Select the desired data table inside which the transformation has been added and click on Refresh Data > With Prompt.
A new window will pop up which will allow you to make the desired changes in each of the transformations.
unfortunately it is NOT possible to reverse data table transformations.
it IS possible to undo the transformations with Edit>>Undo or CTRL+Z, but that's as far as it goes.
my strategy for dealing with this is (in accordance with your #3) to visit Edit>>Data Table Properties, select the table I'm interested in, select Source Information, then copy the contents of the textarea and paste it into notepad. then, I'll File>>Replace Data Table and start over from the beginning while keeping the notepad open so I don't miss any steps.
I realize it's not ideal, but there is unfortunately not another way.

Reading XML-files with StAX / Kettle (Pentaho)

I'm doing an ETL-process with Pentaho (Spoon / Kettle) where I'd like to read XML-file and store element values to db.
This works just fine with "Get data from XML" -component...but the XML file is quite big, several giga bytes, and there fore reading the file takes too long.
Pentaho Wiki says:
The existing Get Data from XML step is easier to use but uses DOM
parsers that need in memory processing and even the purging of parts
of the file is not sufficient when these parts are very big.
The XML Input Stream (StAX) step uses a completely different approach
to solve use cases with very big and complex data stuctures and the
need for very fast data loads...
There fore I'm now trying to do the same with StAX, but it just doesn't seem to work out like planned. I'm testing this with XML-file which only has one element group. The file is read and then mapped/inserted to table...but now I get multiple rows to table where all the values are "undefined" and some rows where I have the right values. In total I have 92 rows in the table, even though it should only have one row.
Flow goes like:
1) read with StAX
2) Modified Java Script Value
3) Output to DB
At step 2) I'm doing as follow:
var id;
if ( xml_data_type_description.equals("CHARACTERS") &&
xml_path.equals("/labels/label/id") ) {
id = xml_data_value; }
...
I'm using positional-staz.zip from http://forums.pentaho.com/showthread.php?83480-XPath-in-Get-data-from-XML-tool&p=261230#post261230 as an example.
How to use StAX for reading XML-file and storing the element values to DB?
I've been trying to look for examples but haven't found much. The above example uses "Filter Rows" -component before inserting the rows. I don't quite understand why it's being used, can't I just map the values I need? It might be that this problem occurs because I don't use, or know how to use, Filter Rows -component.
Cheers!
I posted a possible StAX-based solution on the forum listed above, but I'll post the gist of it here since it is awaiting moderator approval.
Using the StAX parser, you can select just those elements that you care about, namely those with a data type of CHARACTERS. For the forum example, you basically need to denormalize the rows in sets of 4 (EXPR, EXCH, DATE, ASK). To do this you add the row number to the stream (using an Add Sequence step) then use a Calculator to determine a "bucket number" = INT((rownum-1)/4). This will give you a grouping field for a Row Denormaliser step.
When the post is approved, you'll see a link to a transformation that uses StAX and the method I describe above.
Is this what you're looking for? If not please let me know where I misunderstood and maybe I can help.

VB6 and data-bound MSHFlexGrid, moving and populating columns

I'm working on a VB6 program that connects to a SQL Server 2008 R2 database. In the past I have always used the MSFlexGrid control and populated it manually. Now, however, the guy who is paying me for this wants me to use data-bound grids instead, which forces me to use the MSHFlexGrid control because I'm using ADO and not DAO. So, I have two questions...
First, how would I move a column in a MSHFlexGrid? For example, if I wanted the third column to appear as the sixth column in the grid, is there a simple single line of code that would do that?
Second, believe it or not, I've never had to do anything in a grid other than display the data, as is, from a recordset. Now, however, I have a recordset with some fields that contain just ID numbers that refer to records in other files - for example, a field containing an ID number referring to a record in the Customers table, instead of the field containing the customer's name. What is the easiest way to, instead of having a column showing customer ID numbers from the recordset, having that column show customer names? I thought I read somewhere that there's a way to embed a sql command in a MSHFlexGrid column, but if there is I wouldn't know how to do it. Is this possible, or is there a simpler way to do it?
TIA,
Kevin
The column order would typically be handled by your SELECT statement.
Say you have a Pies table that has a FruitID foreign key related to the FruitID in a Fruits table:
SELECT PieID AS ID, Pie, Fruit FROM Pies LEFT OUTER JOIN Fruits
ON Pies.FruitID = Fruits.FruitID
This returns 3 items: ID, Pie, and Fruit in that order.
Moving columns after the query/display operation is rarely used, but yes ColPosition can be used for that.
Wow! VB6.... Back to the future! :-)
You can move Columns using the ColPosition Property.
This article shows how you could setup the grid to display hierarchical data.
If you just want to display the customer name on the same line as the main data then that is doable as well by just creating the proper SQL for your data source. For that matter you can control the column order the same way as well.
Now, how about considering upgrading to .Net? Just kidding..... No, I'm not. OK. I am, maybe. :-)

Resources