Conditional Mapping in Talend - etl

I have created a simple job in Talend that will perform an inner join in the data between 2 excel sheets and then dump the result in an output excel sheet. This can be best illustrated by the below diagram :-
The mapping used in tMap is :-
However the additional challenge for me now is that I have to perform this mapping only if the column value in that row is not NULL. eg there is a mapping row1.RECID = row2.RECID, but this should only be legal if row2.RECID is not NULL.
How do I achieve this in Talend? I have experimented a lot with tMap expressions but can't get it right..
Here is a small sample input and it's corresponding expected output.
Suppose my input has values :-
v1, v2,v3,v4
1 , A, O, 3
2, B, X, 4
3, C, X, 4
and lookup has values
v1, v2, v3
1, A, O, 3
2, null, X, 4
3, null, C, 4
2,null,X,null
Then the output should be :-
v1,v2,v3
1,A,O,3
2,B,X,4
2,B,X,4

Before joining your input flows, you have to reject rows with null values, I have created a mapping based on the given simple data.

Try to map the maximum of values from row1, the put row2 with left outer join.
I you want values which are only in row1 and row2, you can add a filter in row2 for that (but I guess that this is not what you want)

Talend does have a more elegant option that will allow the filtering of your data on multiple columns. Use the tSchemaComplianceCheck component where filtering out nulls and empty is as simple as clicking a couple of check boxes. This allows you to use your own schema to check against nulls and empty values and filter them out. The error rows go to a reject flow which you have the option of processing. If you do not wish to capture and process the rejects you can simply ignore them. Your main flow will only have the records that passed the compliance check. Here are some tips on using it:
In the tSchemaComplianceCheck component -->Basic Settings Screen click Custom Defined and it will show you each column. Make sure Nullable is unchecked or else it will allow nulls to pass thru.
In the Advanced Settings tab check Treat all empty string as NUll. This will work in conjunction with the prior step to filter out both null and empty.
In your Excel component, click Advances Settings tab, and check Stop reading on encountering empty rows.
below is a screen shot which shows the basic flow and setting. You would link to a tMap instead of the tLogRow. If I have understood your problem correctly I think you will find this is the ideal solution in Talend.

Related

Google Sheet Query: Select misses data when there are different data type in a column?

I have a table like this:
a
b
c
1
2
abc
2
3
4.00
note c2 is text while c3 is a number.
When I do
=QUERY(A1:C,"select *")
The result is like
a
b
c
1
2
2
3
4.00
The "text" in C2 has been missed. You can see the live sheet here:
https://docs.google.com/spreadsheets/d/1UOiP1JILUwgyYUsmy5RzQrpGj7opvPEXE46B3xfvHoQ/edit?usp=sharing
How to deal with this issue?
QUERY is very useful, but it has a main limitation: only can handle one kind of data per column. The other data is left as blank. There are usually ways to try to overcome this from inside the QUERY, but I've found them unfruitful. What you can do is just to use:
={A:C}
You can work with filters by its own, but as a step-by-step to adapt the main features of query: If you need to add conditions, use LAMBDA INDEX and FILTER
For example, to check where A is not null:
=LAMBDA(quer,FILTER(quer,INDEX(quer,,1)<>""))({A:C}) --> with INDEX(quer,,1), I've accesed the first column
Where B is more than one cell and less than other:
=LAMBDA(quer,FILTER(quer,INDEX(quer,,2)>D1,INDEX(quer,,2)<D2))({A:C})
For sorting and limiting an amount of items, use SORTN. For example, you want to sort by 3rd column and limit to 5 higher values in that column:
=LAMBDA(quer,SORTN(FILTER(quer,INDEX(quer,,1)<>""),5,1,3,0))({A:C})
Or, to limit to 5 elements without sorting use ARRAY_CONSTRAIN:
=ARRAY_CONSTRAIN(LAMBDA(quer,FILTER(quer,INDEX(quer,,1)<>""))({A:C}),5)
There are other options, you can use REGEXMATCH and other options, and emulate QUERYs functions without missing data. Let me know!
shenkwen,
If you are comfortable with adding an Google App Script in your sheet to give you a custom function, I have a QUERY replacement function that supports all standard SQL SELECT syntax. I don't analyze the column data to try and force to one type based on which is the most common data in the column - so this is not an issue.
The custom function code - is one file and is at:
https://github.com/demmings/gsSQL/tree/main/dist
After you save, you have a new function from your sheet. In your example, the syntax would be
=gsSQL("select a,b,c from testTable", {{"testTable", "F150:H152", 60, true}})
If your data is on a separate tab called 'testTable'(or whatever you want), the second parameter is not required.
I have typed in your example data into my test sheet (see line 150)
https://docs.google.com/spreadsheets/d/1Zmyk7a7u0xvICrxen-c0CdpssrLTkHwYx6XL00Tb1ws/edit?usp=sharing

SSIS: flagging ALL the Data Quality issues in each row with Conditional Split

I have been tasked with performing Data Quality checks on data from a SQL table, whereby I export problem rows into a separate SQL table.
So far I've used a main Conditional Split that goes into derived columns: 1 per conditional split condition. It is working whereby it checks for errors, and depending on which condition is failed first, the data is output with a DQ_TYPE column populated with a certain code (e.g. DQ_001 if it had an error with the Hours condition, DQ_002 if it hit an error with the Consultant Code condition, and so on).
The problem is that I need to be able to see all of the errors within each row. For example at the moment, if Patient 101 has a row in the SQL table that has errors in all 5 columns, it'll fail the first condition in Conditional Split and 1 row will get output into my results with the code DQ_001. I would instead need it to be output 5 times, once for each error that it encountered, i.e. 1 row with DQ_001, a 2nd row with DQ_002, a 3rd row with DQ_003 and so on.
The goal is that I will use the DataQualityErrors SQL table to create an SSRS report that groups on DQ_TYPE and we can therefore Pie Chart to show the distribution of which error DQ_00X codes are most prevalent.
Is this possible using straightforward toolbox functions? Or is this only available with complex Script tasks, etc.?
Assuming I understand your problem, I would structure this as a series of columns added to the data flow via Derived Column transformation.
Assume I have inbound like this
SELECT col1, col2, col3, col4;
My business rules
col1 cannot contain nulls DQ_001
col2 must be greater than 5 DQ_002
col3 must be less than 3 DQ_003
col4 has no rules
From my source, I would add a Derived Column Component
New Column named Pass_DQ_001 as a boolean with an expression !isnull([col1])
New Column named Pass_DQ_002 as a boolean with an expression [col2] > 5
New Column named Pass_DQ_003 as a boolean with an expression [col3] < 3
etc
At this point, your data row could look something like
NULL, 4, 4, No Rules, False, False, False
ABC, 7, 2, Still No Rules, True, True, True
...
If you have more than 3 to 4 data quality conditions, I'd add a final Derived Column component into the mix
New column IsValid as yet another boolean with an expression like Pass_DQ_001 && Pass_DQ_002 && Pass_DQ_003 etc
The penalty for adding additional columns is trivial compared to trying to debug complex expressions in a dataflow so don't do it - especially for bit columns.
At this point, you can put a data viewer in there and verify that yes, all my logic is correct. If it's wrong, you can zip in and figure out why DQ_036 isn't flagging correctly.
Otherwise, you're ready to then connect the data flow to a Conditional Split. Use our final column IsValid and things that match that go out the Output 1 path and the default/unmatched rows head to your "needs attention/failed validation" destination.

How to skip columns in “List from a range” Criteria?

Is it possible to create a "List from a range" Data Validation rule in Google Sheets where the range skips columns?
For example:
Cells A6:A11 is limited to the range A1:B3. Cells B6:B11 is limited to the range A1:A3 AND C1:C3 (skips column B).
Creating a Data Validation rule for cells A6:A11 is trivial as I simply need to create a Criteria of "List from a range = A1:B3".
However, creating the Data Validation rule for cells B6:B11 is not so intuitive since Google Sheets does not allow me to create a Criteria using the syntax "List from a range = A1:A3, C1:C3".
Does the "List from a range" Criteria support a syntax that allows us to skip columns within a range?
Note: I currently have a work around for this where I defined an array formula in D1 = =ArrayFormula(if({1,""},A1:A3,C1:C3)) and then use D1:E3 as the Data Validation range. But this is a hacky solution and I'm hoping there is a better way to accomplish my goal.
The solution is to use { } to create a combination of columns or rows that will result in some sort of virtual table on-the-fly.
Example:
Assuming you have a spreadsheet with Name, Age, Gender, Phone and Address in A, B, C, D and E, and you want to skip the Gender (column C) while using the UNIQUE statement, you can use something like this.
Put in G1 the following formula:
=UNIQUE({A1:B, D1:E})
From the cell G1, the spreadsheet will populate the columns G, H, I and J with unique combinations of A, B, D and E, excluding the column C (Gender).
The same application of a combined range can be used in any formula and also you can combine multiple different ranges, including cross Spreadsheets and Files.
It is a very useful trick if you need to combine pieces of multiple spreadsheets for data visualization or reports. However, always remember you cannot manipulate the displayed data. You can still search through it, format it, etc., but you cannot change it. On the other hand, it will auto-update always if the data source gets updated, which is very useful.
Note: Try it with LOOKUP, VLOOPUP or HLOOKUP.

Tableau: Set a filter with one value always selected and let user to choose others?

I want to create a table with a filter for use to select and compare things:
Say I have a variable Var, containing values A, B, C, D, E. I want to have a filter so that an user can select one of A B C D, meanwhile E is always selected. So the selected one and the fixed E can be display in one single table.
What is the best approach to achieve this (I checked other posts and they seem not working)?
One easy solution if the choices for your variable are relatively stable is as follows:
Create a parameter called say, Chosen_Var, containing the values that you want the user to choose between (i.e. A, B, C, D). Parameters can hold a single value.
Create a calculated field called say, Var_Desired, to distinguish whether an individual data row meets your filter criteria [Var] = "E" or [Var] = [Chosen_Var]
Place that field, [Var_Desired], on the filter shelf and select only the True value
Show your parameter control and configure as desired
This will allow users to select one of the values A, B, C or D, and then filter to only include data where [Var] = the fixed value E or the value the user selected.
If the set of legal values changes frequently, so that using a static list of parameter values is difficult, or if you want the user to be able to select multiple values, you'll need another solution.

kibana 4 discover table in dashboard [duplicate]

I'm testing Kibana 4 for a project.
I have created an index from my database table which is composed by 3 fields:
Date
User
Action
I would like to display my index as a simple table (3 column, N rows) in my dashboard.
I tried to use "Data table" visualization but I can't find a way to display my results without any Metrics (Count, Sum etc...)
Maybe is pretty simple and I missed something... is there a way to do this?
Regards,
On the Discover tab, create a view that has just the fields you want and then save that as a search.
On the Dashboard tab, click on Edit then hit the + Create new button to add a widget, but if you look at the top, there's a Searches tab. Select that and add your saved search in.
[Elastic 7.x / 2019 Update]
I was a bit confused when I read #Alcanzar's answer so I am sharing a little more noob-friendly step-by-step how-to here :
STEP 1 : Create the Index Pattern
STEP 2 : Go to the Dashboard view, and create a view on your index
Select each column you want to include/add in your view by clicking "add" on it (The confusing part is that until you do that, you will have a "scrambled" view listing everything in a jumbled way.)
STEP 3 : Go to the Dashboard view, and create a view on your index
The trick is to select the specific columns you want to include... and voila !
Don't forget to save your view, this will help a lot in the process.
In Kibana 7.5.0 you can do it as follows:
Go to Discover section
Select fields you are interested in
Click on Save to save your discover search so you can use it in visualizations and dashboards
Click on Dashboard and create a new dashboard
Click on Add and select the panel
There is no step 6
The accepted solution has its pros (if, for simplicity, you see your index as a table, this is the only way to deal with rows naturally) but also cons (it allows the user to see too much information, by expanding the records that appear in the table; users cannot get an export of the values).
So if you plan to build tables to use in reports seen by users which should not see everthing and may want to get exports of the data, I recommend a different (hacky) approach using Table visualizations:
Say you have three columns A, B and C:
If there are no duplicates considering the combined values of A and B, you can use these two vales as aggregation fields, and then set a Max or Top hit Metric for C.
If even A, B and C have duplicates, then you can use the three of them as aggregation fields and add a Metric count, that will give you the number of repeated rows. This solution makes somehow sense, because instead of repeating the same row 'n' times you just tells you should have repeated 'n' times that row.
If A and B have duplicates but A, B and C are unique, then there is, afaik, no elegant solution. You have to use the three of them as aggregation fields, but then you would have a dummy metric at the end (e.g. count, always equal to 1).
Why? why do we have to go through all of this? that is another question...

Resources