Transformations with desired values - etl

I'm trying a process with powercenter designer but I do not get the desired objective.
I have these initial data:
CODE CODE2 OPTION
001 A 89
001 A 55
001 A 12
002 B 25
002 A 59
025 A 44
I have to get it for code to do the following: if there are several records per CODE then you have to put the value 1111 in the OPTION2 field to the record with the highest value in the OPTION field, if there is only one record in CODE it also puts the value 1111. I do this by making an SORTER transformation in powercenter, not complicated.
What I can not do is the next step. The second record with the highest value in the OPTION field corresponds to the value of the first field and so on.
OUTPUT:
CODE CODE2 OPTION OPTION2
001 A 89 111111
001 A 55 89
001 A 12 55
002 A 59 111111
002 B 25 59
025 A 44 111111
How could I get this?
What transformations should I use?
Thanks! ^^

You can sort it by code and descending order of option. Then in an expression variable hold the value for the previous record's value in a variable.
v_OPTION2 = IIF(ISNULL(v_PREV_CODE) OR CODE != v_PREV_CODE,
111111,
v_PREV_OPTION
)
out_OPTION2 = v_OPTION2
v_PREV_OPTION = OPTION
v_PREV_CODE = CODE

Related

Count number of connections in each network using DAX

I have a dataset like this in Power BI with connections between "Participant ID" Column and "Knows Participant":
Participant ID
Knows Participant
111
353
111
777
111
112
111
249
112
143
112
144
113
111
113
244
114
NaN
115
113
...
...
777
111
777
398
777
114
778
NaN
779
112
3499
NaN
I've build Network chart. However, there are a lot of 1-1 connections that are not very useful for visualization, so I want to exclude them (see image):
Is it possible to count a number of connections in each network using DAX and then use this value to filter out all nodes with only 1 connection (red circled)? Or maybe filter out 1 connection nodes using another approach?
I've tried to make a calculated column using DAX:
Connection Column = COUNTROWS(
FILTER(Table,
EARLIER(Table[Knows Participant])=Table[Knows Participant])
)
However, it only shows duplicate values in "Knows Participant" Column, but not number of connections in each network.
Example of desired output:
Participant ID
Knows Participant
Number of Connections in the Network
111
353
4
353
444
4
444
551
4
551
987
4
112
143
1
220
190
1
333
337
2
337
410
2
765
0
You need the PATH functions as you're essentially trying to flatten a hierarchy and then exclude certain parts of it. The following help page gives a good rundown of the approach to take.
https://learn.microsoft.com/en-us/dax/understanding-functions-for-parent-child-hierarchies-in-dax
You can add a column to the table with a measure like this:
VAR pIdLinksCount = CALCULATE(COUNTROWS(tbl), ALL('tbl'[Knows Participant]))
VAR neighbourLinksCount =
IF(
pIdLinksCount=1
, -- if pIdLinksCount=1 then count neighbour links
VAR neighbourId =
CALCULATETABLE(
Values('tbl'[Knows Participant])
)
RETURN
CALCULATE(
COUNTROWS(tbl)
,ALL() -- removes all filters from data model
,'tbl'[Participant ID] = neighbourId -- applies filter to [Participant ID] column
--,'tbl'[Participant ID] IN neighbourId -- alternatively try this. I believe it is not necessary
)
,2 -- returns 2 if pIdLinksCount>1.
-- The "value = 2" will return "result > 3 = TRUE()"
)
VAR result = pIdLinksCount + neighbourLinksCount
RETURN
IF(
result>2
,1
,0
)
The idea is to check a neighbor too - if it has more then 1 link

removing bad data from a data file using pig

I have a data file like this
1943 49 1
1975 91 L
1903 56 3
1909 52 3
1953 96 3
1912 82
1976 66 3
1913 35
1990 45 1
1927 92 A
1912 2
1924 22
1971 2
1959 94 E
now using pig script I want to remove the bad data like removing those rows which have characters and empty fields
I tried this way
records = load '/user/a106524609/test.txt' using PigStorage(' ') as
(year:chararray, temperature:int, quality:int);
rec1 = filter records by temperature != 'null' and (quality != 'null ')
Load it as lines
A = load 'data.txt' using PigStorage('\n') as (line:chararray);
Split on all whitespaces
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '\\s+')) as (year:int,temp:int,quality:chararray);
Filter by valid strings
C = FILTER B BY quality IN ('0','1','2','3','4','5','6','7','8','9');
(Optionally) Cast to an int
D = FOREACH C GENERATE year,temp,(int)quality;
In Spark, I would start with a regex match of the expected format.
val cleanRows = sc.textFile("data.txt")
.filter(line => line.matches("(?:\\d+\\s+){2}\\d+"))

Different Code 128 barcode symbols representing the same data

I'm currently using software called LineView. It generates downtime reason codes for our factory lines. An operator scans the barcodes with an RS232 scanner and it goes into our XL board system.
The software itself generates the barcodes within an internet browser, but I am trying to make it so our own labeling machine can also print out the barcodes. However, the barcodes that are produced by the labeler (and the many online barcode generators I've tried) look longer and do not work.
The data for the example 128 barcode that I am trying to replicate is [SOH]1[STX]65;1067[ETX].
According to the manual:
- The Start of Header character (ASCII 0x01) starts the XL Command packet.
1 - The Serial Address of the XL device (the default is 1).
- The Start of Transmission character (ASCII 0x02) marks the start of the actual command.
65; - The ID of the Production State > Set Reason Code command.
The Reason Code ID (which can range from 1 to 999 for system reasons or 1000 to 1999 for user defined reasons). In my case it is 1067
- The End of Transmission character (ASCII 0x03) ends the XL Command packet.
I have attatched the pictures of what LineView produces (which is what I want it to look like) and what it is currently printing like on our labeller.
When I scan them they both come up with the [SOH]1[STX]65;1067[ETX] code despite them looking different.
Any help with this would be very much appreciated.
Your intended barcode is constructed internally using the following series of Code 128 codewords which correctly represent the ASCII control characters:
103 Start-in-Mode-A (Upper-case and control characters)
65 [SOH] (ASCII 1)
17 1
66 [STX] (ASCII 2)
22 6
21 5
27 ;
99 Switch-to-Mode-C (Double-density numeric)
10 10
67 67
101 Switch-to-Mode-A
67 [ETX] (ASCII 3)
67 Check-digit
106 Stop
Your label printer is printing a barcode representing the literal string [SOH]1[STX]65;1067[ETX] with no ASCII control characters (i.e. left-bracket, S, O, H, right-bracket, ...) using the following internal codewords:
104 Start-in-Mode-B (Mixed-case)
59 [
51 S
47 O
40 H
61 ]
17 1
59 [
51 S
52 T
56 X
61 ]
22 6
21 5
27 ;
99 Switch-to-Mode-C (Double-density numeric)
10 10
67 67
100 Switch-to-Mode-B
59 [
37 E
52 T
56 X
61 ]
57 Check-digit
106 Stop
So you need to work out how to correctly specify ASCII control characters in the input to your labelling machine.

Pig Scripting - Cast STRING to INT

Beginner in Pig, Need help
For all NON - AlphaNumeric, Cast the STRING TO INT
- To be handled without passing each field name separately.
Sample data -
00013425731998101620140402300032736901 00000000AAA001200X111685V00000000
00283335542006120920131010300030003105 00000000AAA001200X117407 00000000
00000000331998101620140402300033128107 00000000AAA001200X111685 00000000
00003902331999090620140402300032545208 00000000AAA001200X111685 00000000
Its a fixedwidth file, mapping details as follow -
orderNumber 1 9
origin 10 10
Startdate 11 18
ModDate 19 26
Identifier 27 36
Code 37 38
CodeType 39 40
Number 41 48
Num 49 114
Either use substr to extract the parts and then cast them or use a regexp. For example for the first two fields:
input = load ... as (line:chararray);
a = foreach input generate SUBSTRING(line, 0, 9) as orderNumber:long, SUBSTRING(line, 9, 10) as origin:chararray;
This way you should be able to convert each part of the input line into the desired components.
Alternatively you could write a UDF that takes a string as input and does the splitting and returns a bag or tuple.

PIG - retrieve data from XML using XPATH

I have n number of these type of xml files.
<students roll_no=1>
<name>abc</name>
<gender>m</gender>
<maxmarks>
<marks>
<year>2014</year>
<maths>100</maths>
<english>100</english>
<spanish>100</spanish>
<marks>
<marks>
<year>2015</year>
<maths>110</maths>
<english>110</english>
<spanish>110</spanish>
<marks>
</maxmarks>
<marksobt>
<marks>
<year>2014</year>
<maths>90</maths>
<english>95</english>
<spanish>82</spanish>
<marks>
<marks>
<year>2015</year>
<maths>94</maths>
<english>98</english>
<spanish>02</spanish>
<marks>
</marksobt>
</Students>
I need output like
roll_no name gender year eng_max_marks maths_max_marks spanish_max_marks
1 abc m 2014 100 100 100
1 abc m 2015 110 110 110
I am able to retrieve marks row wise in single statement but not able to extract roll_no and name with this.
A = LOAD 'student.xml' using org.apache.pig.piggybank.storage.XMLLoader('marks') as (x:chararray);
B = FOREACH A GENERATE XPath(x, 'marks/year'), XPath(x, 'marks/english'), XPath(x, 'marks/math'), XPath(x, 'marks/spanish');
This return
year eng_max_marks maths_max_marks spanish_max_marks
2014 100 100 100
2015 110 110 110
I can extract both the chunks but not getting how to join other fields. I can't use across join because I have n number of other files.
Let's forger attribute name (roll_no) for now. How can I extract the rest of nodes
name gender year eng_max_marks maths_max_marks spanish_max_marks
abc m 2014 100 100 100
abc m 2015 110 110 110
I don't want to use marks(1)/english approach because this nodes can also vary and don't want to adopt any dirty approach.
Any pointers????

Resources