SSIS: flagging ALL the Data Quality issues in each row with Conditional Split - visual-studio

I have been tasked with performing Data Quality checks on data from a SQL table, whereby I export problem rows into a separate SQL table.
So far I've used a main Conditional Split that goes into derived columns: 1 per conditional split condition. It is working whereby it checks for errors, and depending on which condition is failed first, the data is output with a DQ_TYPE column populated with a certain code (e.g. DQ_001 if it had an error with the Hours condition, DQ_002 if it hit an error with the Consultant Code condition, and so on).
The problem is that I need to be able to see all of the errors within each row. For example at the moment, if Patient 101 has a row in the SQL table that has errors in all 5 columns, it'll fail the first condition in Conditional Split and 1 row will get output into my results with the code DQ_001. I would instead need it to be output 5 times, once for each error that it encountered, i.e. 1 row with DQ_001, a 2nd row with DQ_002, a 3rd row with DQ_003 and so on.
The goal is that I will use the DataQualityErrors SQL table to create an SSRS report that groups on DQ_TYPE and we can therefore Pie Chart to show the distribution of which error DQ_00X codes are most prevalent.
Is this possible using straightforward toolbox functions? Or is this only available with complex Script tasks, etc.?

Assuming I understand your problem, I would structure this as a series of columns added to the data flow via Derived Column transformation.
Assume I have inbound like this
SELECT col1, col2, col3, col4;
My business rules
col1 cannot contain nulls DQ_001
col2 must be greater than 5 DQ_002
col3 must be less than 3 DQ_003
col4 has no rules
From my source, I would add a Derived Column Component
New Column named Pass_DQ_001 as a boolean with an expression !isnull([col1])
New Column named Pass_DQ_002 as a boolean with an expression [col2] > 5
New Column named Pass_DQ_003 as a boolean with an expression [col3] < 3
etc
At this point, your data row could look something like
NULL, 4, 4, No Rules, False, False, False
ABC, 7, 2, Still No Rules, True, True, True
...
If you have more than 3 to 4 data quality conditions, I'd add a final Derived Column component into the mix
New column IsValid as yet another boolean with an expression like Pass_DQ_001 && Pass_DQ_002 && Pass_DQ_003 etc
The penalty for adding additional columns is trivial compared to trying to debug complex expressions in a dataflow so don't do it - especially for bit columns.
At this point, you can put a data viewer in there and verify that yes, all my logic is correct. If it's wrong, you can zip in and figure out why DQ_036 isn't flagging correctly.
Otherwise, you're ready to then connect the data flow to a Conditional Split. Use our final column IsValid and things that match that go out the Output 1 path and the default/unmatched rows head to your "needs attention/failed validation" destination.

Related

Google Sheet Query: Select misses data when there are different data type in a column?

I have a table like this:
a
b
c
1
2
abc
2
3
4.00
note c2 is text while c3 is a number.
When I do
=QUERY(A1:C,"select *")
The result is like
a
b
c
1
2
2
3
4.00
The "text" in C2 has been missed. You can see the live sheet here:
https://docs.google.com/spreadsheets/d/1UOiP1JILUwgyYUsmy5RzQrpGj7opvPEXE46B3xfvHoQ/edit?usp=sharing
How to deal with this issue?
QUERY is very useful, but it has a main limitation: only can handle one kind of data per column. The other data is left as blank. There are usually ways to try to overcome this from inside the QUERY, but I've found them unfruitful. What you can do is just to use:
={A:C}
You can work with filters by its own, but as a step-by-step to adapt the main features of query: If you need to add conditions, use LAMBDA INDEX and FILTER
For example, to check where A is not null:
=LAMBDA(quer,FILTER(quer,INDEX(quer,,1)<>""))({A:C}) --> with INDEX(quer,,1), I've accesed the first column
Where B is more than one cell and less than other:
=LAMBDA(quer,FILTER(quer,INDEX(quer,,2)>D1,INDEX(quer,,2)<D2))({A:C})
For sorting and limiting an amount of items, use SORTN. For example, you want to sort by 3rd column and limit to 5 higher values in that column:
=LAMBDA(quer,SORTN(FILTER(quer,INDEX(quer,,1)<>""),5,1,3,0))({A:C})
Or, to limit to 5 elements without sorting use ARRAY_CONSTRAIN:
=ARRAY_CONSTRAIN(LAMBDA(quer,FILTER(quer,INDEX(quer,,1)<>""))({A:C}),5)
There are other options, you can use REGEXMATCH and other options, and emulate QUERYs functions without missing data. Let me know!
shenkwen,
If you are comfortable with adding an Google App Script in your sheet to give you a custom function, I have a QUERY replacement function that supports all standard SQL SELECT syntax. I don't analyze the column data to try and force to one type based on which is the most common data in the column - so this is not an issue.
The custom function code - is one file and is at:
https://github.com/demmings/gsSQL/tree/main/dist
After you save, you have a new function from your sheet. In your example, the syntax would be
=gsSQL("select a,b,c from testTable", {{"testTable", "F150:H152", 60, true}})
If your data is on a separate tab called 'testTable'(or whatever you want), the second parameter is not required.
I have typed in your example data into my test sheet (see line 150)
https://docs.google.com/spreadsheets/d/1Zmyk7a7u0xvICrxen-c0CdpssrLTkHwYx6XL00Tb1ws/edit?usp=sharing

Compare Dynamic Lists Power BI

I have a table ("Issues") which I am creating in PowerBI from a JIRA data connector, so this changes each time I refresh it. I have three columns I am using
Form Name
Effort
Status
I created a second table and have summarized the Form Names and obtained the Total Effort:
SUMMARIZE(Issues,Issues[Form Name],"Total Effort",SUM(Issues[Effort (Days)]))
But I also want to add in a column for
Total Effort for each form name where the Status field is "Done"
My issue is that I don't know how to compare both tables / form names since these might change each time I refresh the table.
I need to write a conditional, something like
For each form name, print the total effort for each form name, print the total effort for each form name where the status is done
I have tried SUMX, CALCULATE, SUM, FILTER but cannot get these to work - can someone help, please?
If all you need is to add a column to your summarized table that sums "Effort" only when the Status is set to 'Done' -- then this is the right place to use CALCULATE.
Table =
SUMMARIZE(
Issues,
Issues[Form Name],
"Total Effort", SUM(Issues[Effort]),
"Total Effort (Done)", CALCULATE(SUM(Issues[Effort]), Issues[Status] = "Done")
)
Here is a quick capture of what some of the mock data that I used to test this looks like. The Matrix is just the mock data with [Form Name] on the rows and [Status] on the columns. The last table shows the 'summarized' data calculated by the DAX above. You can compare this to the values in the matrix and see that they tie out.

Query referencing 20 sheets / Indirect error with multiple ranges

I have 20 sheets (Eagle, Kestral etc) and want to reference the whole group of them, in different queries.
To stop query formula text being massive I have tried to use the Indirect function but looks like Indirect may not be able to return multiple ranges.
Example for just 2 sheets:
Query({Indirect(A1)}) where A1 contains the text Eagle!F3:I33;Kestrel!F3:I33
gives Indirect error "not a valid cell/range reference".
The 2 formulas below work OK but become unweildy when referencing 20 sheets.
Query({Eagle!F3:I33;Kestrel!F3:I33})
Query{indirect(A2); indirect(A3)} where A2 is Eagle!F3:I33 and A3 is Kestrel!F3:I33
Suggestions please (no script).
Challenge2 = How to include sheet name (bird) in Col1 of query output. Sheet name (bird) is written in cell A1 of each sheet.
Here is the solution that I settled on.
Problem summary
Challenge 1: Avoid oversized query formula when referencing many sheets/tabs.
Challenge2: Return sheet name as part of the query output.
Key information
Script is not an option as causes access and performance issues for users in my organisation.
Indirect function cannot pull multiple ranges into a Query.
There is not a function that returns sheet names (except within Script).
I started with a static list of sheet names.
Each sheet contains Name and Total data, but needs to be tagged with sheet name to identify it in output of query. Each sheet also included the sheet name in cell A1 (but not used in solution).
Solutions
Solution to Challenge 1: Specify the unique sheet ranges & select statements within hidden helper columns then reference them in the query.
Solution to Challenge 2: Insert sheet name as text within each select statement.
=query(
{query({indirect(B4)},C4);query({indirect(B5)},C5);
query({indirect(B6)},C6);query({indirect(B7)},C7);
query({indirect(B8)},C8);query({indirect(B9)},C9);
query({indirect(B10)},C10);query({indirect(B11)},C11);
query({indirect(B12)},C12);query({indirect(B13)},C13);
query({indirect(B14)},C14);query({indirect(B15)},C15);
query({indirect(B16)},C16);query({indirect(B17)},C17);
query({indirect(B18)},C18);query({indirect(B19)},C19);
query({indirect(B20)},C20);query({indirect(B21)},C21);
query({indirect(B22)},C22);query({indirect(B23)},C23)}
,"where Col3 >="&F2 &B2 ,0)
useful screen shot - helper columns and output
Cells F2 & B2 are user defined. F2 is the minimum value to return. B2 relates to ordering of output.
B2 creates an extra bit of text for select statement, depending on user defined dropdown in E2.
=if(E2="order by lap count"," order by Col3 desc",)
The ,0 at the end of the final wraparound query is the optional query header row clause. Zero tells query that the input data has no headers. Necessary for this query.
The curly brackets inside each sheet query convert column names F, G, H to Col1, Col2 Col3.
The curly brackets and semicolons in the final wraparound query combine the sheet query outputs into an array, one underneath the other.
Top Tip – When referencing multiple sheets/tabs in a query, it is better create a wraparound query (as above) to filter the output . This is because if you were to filter the individual sheet queries and one of them returned no data, the curly brackets in the wraparound query would return an array error.

Power query - strategy for handling repeating rows

Given a report which as a table with repeated row headings, is there a good strategy for using Power Query/M to extract the data in a clean format?
For example the report available here, has an excel file (which at time of writing is pointing to August 2021):
https://www.opec.org/opec_web/static_files_project/media/downloads/publications/MOMR%20Appendix%20Tables%20(August%202021).xlsx
In this example:
we have the World demand table portion
Non-OPEC Liquids production portion
both of these have rows: Americas/Europe/Asia Pacific:
which makes it hard to distinguish them in Power Query
What is right approach which would allow extraction of data from this type of table?
I would add a column ... custom column ... with formula
=if [2018] = null then [Column] else null
and then right click the new column and fill down
That would put World Demand and non-OPEC as a column that you could additionally filter on

Conditional Mapping in Talend

I have created a simple job in Talend that will perform an inner join in the data between 2 excel sheets and then dump the result in an output excel sheet. This can be best illustrated by the below diagram :-
The mapping used in tMap is :-
However the additional challenge for me now is that I have to perform this mapping only if the column value in that row is not NULL. eg there is a mapping row1.RECID = row2.RECID, but this should only be legal if row2.RECID is not NULL.
How do I achieve this in Talend? I have experimented a lot with tMap expressions but can't get it right..
Here is a small sample input and it's corresponding expected output.
Suppose my input has values :-
v1, v2,v3,v4
1 , A, O, 3
2, B, X, 4
3, C, X, 4
and lookup has values
v1, v2, v3
1, A, O, 3
2, null, X, 4
3, null, C, 4
2,null,X,null
Then the output should be :-
v1,v2,v3
1,A,O,3
2,B,X,4
2,B,X,4
Before joining your input flows, you have to reject rows with null values, I have created a mapping based on the given simple data.
Try to map the maximum of values from row1, the put row2 with left outer join.
I you want values which are only in row1 and row2, you can add a filter in row2 for that (but I guess that this is not what you want)
Talend does have a more elegant option that will allow the filtering of your data on multiple columns. Use the tSchemaComplianceCheck component where filtering out nulls and empty is as simple as clicking a couple of check boxes. This allows you to use your own schema to check against nulls and empty values and filter them out. The error rows go to a reject flow which you have the option of processing. If you do not wish to capture and process the rejects you can simply ignore them. Your main flow will only have the records that passed the compliance check. Here are some tips on using it:
In the tSchemaComplianceCheck component -->Basic Settings Screen click Custom Defined and it will show you each column. Make sure Nullable is unchecked or else it will allow nulls to pass thru.
In the Advanced Settings tab check Treat all empty string as NUll. This will work in conjunction with the prior step to filter out both null and empty.
In your Excel component, click Advances Settings tab, and check Stop reading on encountering empty rows.
below is a screen shot which shows the basic flow and setting. You would link to a tMap instead of the tLogRow. If I have understood your problem correctly I think you will find this is the ideal solution in Talend.

Resources