Pentaho, trying to merge a small list of cities with a big list of person using same Key field - etl

im trying to merge two lists, for both list i added a sequence to count to a maximium of 18, because i have a list of 18 cities.
This is my transformation:
Basically i added the city_ID-Sequence to do a sequence to a maximum of 18 since my text file ID, have a field "ID" with a maximum of 18.
The idea would be when merging on "merge join 2", would merge with every thing with the same ID, repeating the cities name on the "csv file input 2", making easy for me to not generate by hand the cities names.
This is the result when i merge on "Merge Join 2":
I try to the the merge with a full outer join, and i made sure i had the right ID. What i pretend is when merging on "merge join 2" the cities names repeat down the line.
This is my citie List:
This is my people list after adding the city_ID sequence:

The Merge join step needs the input rows ordered by the key you are using to join, so you need a Sort step before each branch of the Merge join.
I assume that in the "Merge join" you are joining by ID_Client, since you are adding that ID_Client to both files using a sequence, and the sequence provides the rows ordered, then in this case is not necessary, but in "Merge join 2" you are joining by ID in the rows coming from "Text file input" and by City_ID in the rows coming from "Merge join", so you need to add one Sort step after the "Text file input" step to sort by ID (yes, not needed in this particular case, but it's good practice in case your file isn't ordered) and another Sort step after the "Merge join" step to order the rows by City_ID.
Add those Sort steps always before a Merge join (unless you have added it before and you haven't changed the order in the following steps) even if you think you don't need them because the files come ordered or you are querying a table using a Sort by. Pentaho uses java internally, so with keys based on character columns or date columns the order based in your language might be not the same order java uses internally due to special characters.
Beware also that the Sort step by default treats cases as the same character ("A" the same as "a") unless you check specifically to treat them as different, the Merge join step treats them different and you'll find mismatches due to that annoying default in Pentaho.

Related

How to make groups in an input and select a specific row in each of them in Talend?

I am working on a Talend transformation process (we are using Talend 6.4).
, and I don't know how to implement the current requirement.
I have an input consisting in :
Two columns that are my group keys (Account and Product), but are not unique (the same Account x Product couple can happen in multiple rows)
A criterion column (Contract end date), which will help me decide which row I want to keep for each group
Some "tail" data that need to be passed to the following step of the processing (the contract number)
The rule to implement is:
Keep only one record per group
The selected record must be one with no end date or, if all have end date, with the biggest end date
The selected record can be random in case there is a tie
See the transformation applying those rules on some dummy data:
I thought first to do the following:
sort by Account, Product, End_date (nulls first)
"select first" in each group
but I am not skilled enough to know whether the second transformation exists in Talend.
Regards,
Pierre
Very interesting Talend question.
You need to create something like this job.
here a link to the zip file to import in your Talend
The answer from #MBDIA seem to be working, however I would like to share what we did to fulfill our requirement.
See our Talend process here:
The first tMap (tMap_3) acts like a tReplicate and a tMap, and sends:
in the upper branch only the Account and Product references, that are then deduplicated by the tAggregateRow_1.
in the lower branch all data and computed fields that enables us to take care of the case where the date is missing (instead of defaulting to 31/12/9999, we compute a flag (0 or 1) that we use in the sort step afterwards).
In the second part of the process, we first apply the sort to the whole data on Account, Product, Empty date flag (computed before), End date (desc) and use a second tMap to make a join on both branches (on Account x Product), only keeping First Match in order to keep the first record as per our requirement.

How To Sort Specific Column Data in SSRS Tablix

I am using Visual Studio 2012 for SSRS and my queries come from using Microsoft SQL Server 2012.
My question below pertains to SSRS and sorting.
In my Tablix, I currently have the Row Groups set up as Group -> Manager -> Owner -> Status Description and when it pulls the data from the data from the dataset, it pulls in fine. In the tablix, basically everything is a drilldown and each of the Row Groups (except Group) is hidden initially and can be displayed/toggled by the report item ahead of it. In the Status Description part, when it pulls in the records, it pulls them in as Active, Completed, In Process...which is fine because they are pulling in ABC Order.
But I want to show that specific column when it pulls in as Active, In Process, Completed...in the way that a specific file would go through the process. These are only 3 specific ways the data could pull it as, there are more, but these are the most common seen. How do I sort that individual column to get it in the way I mentioned above or a way that I can customize the sorting based on how I want it to be seen?
You can use expressions in the sort order to get the sorting exactly as you'd like.
What you can do is use the SWITCH statement to number your output in the order you want. For example I have 3 statuses: "Complete", "On Hold" and "Word at Risk". Normally they would either sort as Ascending or Descending... but if I enter this in the Sort Order Formula I can change that:
=SWITCH(
Fields!PRS_STATUS.Value = "Complete", 1,
Fields!PRS_STATUS.Value = "Word at Risk", 2 ,
Fields!PRS_STATUS.Value = "On Hold", 3)
And now it orders 1,2,3 aka Complete, Work at Risk, On Hold.
You can put this switch in a larger switch statement to have multiple sort orders depending on a criteria or parameter.

Access 2013: Check if a value is present in another table

I've just discovered Access, having always been an Excel/VBA man... and now I've hit a roadblock!
I'm building an inventory database for my employer. I have 2 tables, one containing one column of 'stockID's (lets call this table 'tblWarehouse'), and another containing two columns: a column of 'orderID's and a column of 'stockID's (lets call this table 'tblOrders'). (For the sake of this question, lets disregard things like quantity, price etc)
We don't keep all the goods we sell in our own warehouse, some are sourced directly from the manufacturer to the customer, which means that not all tblOrders!stockID will be present in the list tblWarehouse!stockID. I need to find out when this is the case!
I want to create a third column in tblOrders containing a dummy variable = 1 if that particular item is in our warehouse. In other words, I want to create a calculated column = 1 if tblOrders!stockID can be found in tblWarehouse!stockID. Can this be done?
I've found that I can't reference another table directly, so I've been trying my hand at queries, user defined functions and relationships, but to no avail. I've also been having trouble with the Access-lingo and veritable forest of different places to input seemingly the same expressions... so please, if u have an answer for me, be sure to specify where things are located!
Much obliged!!
If you are linking the two tables in a query using an inner join, only order records having at least one stock entry will be included in the result. In order to include those with no stock entry at all, create a left outer join.
SELECT O.OrderID, IIf(IsNull(MAX(W.StockID)), 0, 1) AS StockAvailable
FROM
tblOrder O
LEFT JOIN tblWarehouse W
ON O.StockID = W.StockID
GROUP BY O.OrderID
You can also determin the join type in the query designer by right clicking a relation line and selecting "Join Properties" and then select "Include ALL records from tblOrders ...". You can make a grouping query by clicking the big Sigma-symbol in the symbol list.

How to sort a Reporting Services table by an auto generated column

I have this table:
When executed, it looks like:
This table is sorted by alphabetical order. I would like to sort it by the column named "No Vencido", which is generated in runtime combining 2 dimensions of a cube (one dimension is called "Class 1", the other dimension is called "value".
How can i sort a table by an autogenerated field?
Thanks
You can sort by any sort of expression - SSRS will quite happily sort something like two fields concatenated together:
=Fields!Class1.Value & Fields!value.Value
Just be careful to make sure the sorting is applied at the appropriate level to avoid unexpected, i.e. make sure you don't have different sorting expressions in any row group or detail group if not required.
If No Vencido is the grouping expression, apply the sorting at the group level.
If you don't want to sort on an expression, you can create a calculated field for each row in the dataset with the expression =Fields!Class1.Value & Fields!value.Value and group/sort on that calculated field as required.
Edit after comment
OK, I think you need to apply a sort expression like this to the groups that apply to the Top and Otros rows:
=Sum(IIf(Fields!Clase_1.Value = "No Vencido", Fields!Monto.Value, Nothing))
This is still sorting by the total Monto for each row group, but only considering the rows where Clase_1 is No Vencido.
Once this is set up sort by A-Z or Z-A as required.

sort in ssis takes much time, if i do sort on ole db command condition split doesnt work

well this is my problem
i use 2 source
first query (select * from servera.databasea.tablea)
secund query(select id, modifiedon from serverb.databaseb.tableb)
sort first query, sort second query
merge join at left join
condition split is.. isnull(idtableb) then i do insert (insert ont serverb)
!isnull(idtableb) && modifiedontableb<modifiedontablea then update(on server b)
it works ok with a few of rows but i work with more of 50000 and it take more than 2 hours on sort and it get error
well my another way was doing
sort on oledbsource on right-click at show advancededitor and on
input and output properties on ole db source output i choose issorted changed to true
on output columns to id i changed to
sortkeyposition to 1
(i didnt put nothing to modifiedon)
so i did these steps for 2 oledbsource
(oledb for server1, and server 2)
it work a lot of faster it finished at 5 minuts and do insert (always)
condition split doesn't work now :s cuz always going to insert
so i on condition split added parse (DT_DBDATE) and it continues being equal (only inserts)
never going to update after i chended mofidiedon parse (DT_DATE) and it continue being equal. then my question is
(i dont want to use sort) how can i do condition split works?
The Sort step takes long because you are running low on memory for your sort operation. This means it will start sorting on disk, and that is horribly slow. Options to this is to use some 3rd party sorting components like NSort.
Otherwise you can do the following:
In order for your MERGE to work your inputs need to be sorted, both in the query, and using the SortKeyPosition. Also they need to be sorted the same.
Your queries should read:
SELECT * FROM servera.databasea.tablea ORDER BY id, modifiedon
SELECT id, modifiedon FROM serverb.databaseb.tableb ORDER BY id, modifiedon
Now set the IsSorted to TRUE, set SortKeyPosition 1 to id
In your MERGE step, use id for join key.
Now in your conditional split, you can use your two output cases.
Please note, if you have MULTIPLE rows per id, you need something more to sort/join on, so that you don't get stuff in the wrong order.

Resources