Count number of connections in each network using DAX - filter

I have a dataset like this in Power BI with connections between "Participant ID" Column and "Knows Participant":
Participant ID
Knows Participant
111
353
111
777
111
112
111
249
112
143
112
144
113
111
113
244
114
NaN
115
113
...
...
777
111
777
398
777
114
778
NaN
779
112
3499
NaN
I've build Network chart. However, there are a lot of 1-1 connections that are not very useful for visualization, so I want to exclude them (see image):
Is it possible to count a number of connections in each network using DAX and then use this value to filter out all nodes with only 1 connection (red circled)? Or maybe filter out 1 connection nodes using another approach?
I've tried to make a calculated column using DAX:
Connection Column = COUNTROWS(
FILTER(Table,
EARLIER(Table[Knows Participant])=Table[Knows Participant])
)
However, it only shows duplicate values in "Knows Participant" Column, but not number of connections in each network.
Example of desired output:
Participant ID
Knows Participant
Number of Connections in the Network
111
353
4
353
444
4
444
551
4
551
987
4
112
143
1
220
190
1
333
337
2
337
410
2
765
0

You need the PATH functions as you're essentially trying to flatten a hierarchy and then exclude certain parts of it. The following help page gives a good rundown of the approach to take.
https://learn.microsoft.com/en-us/dax/understanding-functions-for-parent-child-hierarchies-in-dax

You can add a column to the table with a measure like this:
VAR pIdLinksCount = CALCULATE(COUNTROWS(tbl), ALL('tbl'[Knows Participant]))
VAR neighbourLinksCount =
IF(
pIdLinksCount=1
, -- if pIdLinksCount=1 then count neighbour links
VAR neighbourId =
CALCULATETABLE(
Values('tbl'[Knows Participant])
)
RETURN
CALCULATE(
COUNTROWS(tbl)
,ALL() -- removes all filters from data model
,'tbl'[Participant ID] = neighbourId -- applies filter to [Participant ID] column
--,'tbl'[Participant ID] IN neighbourId -- alternatively try this. I believe it is not necessary
)
,2 -- returns 2 if pIdLinksCount>1.
-- The "value = 2" will return "result > 3 = TRUE()"
)
VAR result = pIdLinksCount + neighbourLinksCount
RETURN
IF(
result>2
,1
,0
)
The idea is to check a neighbor too - if it has more then 1 link

Related

why my dxf file not working in AutoCAD giving me ID 11 incorrect: already used

I have generated a dxf file but when I opened it with AutoCAD, crashes AutoCAD and gives a message ID 11 incorrect: already used.
the dxf content: https://github.com/tarikjabiri/dxf/blob/dev/examples/latest.dxf
I can't spot the problem 3 days I am trying to solve it.
I think something wrong with the APPID because it holding the ID 11 or the Handle in the language of DXF.
I have a dxf working: https://github.com/tarikjabiri/dxf/blob/dev/examples/Minimal_DXF_AC1021.dxf
Thanks in advance.
There are two minor issues:
DIMSTYLE table
0
TABLE
2
DIMSTYLE
105 <<< handle group code of the table "head" is 5 as usual
8
100
AcDbSymbolTable
100
AcDbDimStyleTable
70
1
0
DIMSTYLE
5 <<< handle group code of the table entry is 105
12
330
8
100
AcDbSymbolTableRecord
100
AcDbDimStyleTableRecord
2
STANDARD
70
0
40
1
BLOCK_RECORD table entries for *MODEL_SPACE and *PAPER_SPACE
0
TABLE
2
BLOCK_RECORD
5
9
330
0
100
AcDbSymbolTable
70
2
0
BLOCK_RECORD
5
14
330
9
100
AcDbSymbolTableRecord
100
AcDbRegAppTableRecord <<< subclass marker string "AcDbBlockTableRecord"
2
*MODEL_SPACE
70
0
70
0
280
After this changes the file opens in Autodesk DWG Trueview 2022.

removing bad data from a data file using pig

I have a data file like this
1943 49 1
1975 91 L
1903 56 3
1909 52 3
1953 96 3
1912 82
1976 66 3
1913 35
1990 45 1
1927 92 A
1912 2
1924 22
1971 2
1959 94 E
now using pig script I want to remove the bad data like removing those rows which have characters and empty fields
I tried this way
records = load '/user/a106524609/test.txt' using PigStorage(' ') as
(year:chararray, temperature:int, quality:int);
rec1 = filter records by temperature != 'null' and (quality != 'null ')
Load it as lines
A = load 'data.txt' using PigStorage('\n') as (line:chararray);
Split on all whitespaces
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '\\s+')) as (year:int,temp:int,quality:chararray);
Filter by valid strings
C = FILTER B BY quality IN ('0','1','2','3','4','5','6','7','8','9');
(Optionally) Cast to an int
D = FOREACH C GENERATE year,temp,(int)quality;
In Spark, I would start with a regex match of the expected format.
val cleanRows = sc.textFile("data.txt")
.filter(line => line.matches("(?:\\d+\\s+){2}\\d+"))

Transformations with desired values

I'm trying a process with powercenter designer but I do not get the desired objective.
I have these initial data:
CODE CODE2 OPTION
001 A 89
001 A 55
001 A 12
002 B 25
002 A 59
025 A 44
I have to get it for code to do the following: if there are several records per CODE then you have to put the value 1111 in the OPTION2 field to the record with the highest value in the OPTION field, if there is only one record in CODE it also puts the value 1111. I do this by making an SORTER transformation in powercenter, not complicated.
What I can not do is the next step. The second record with the highest value in the OPTION field corresponds to the value of the first field and so on.
OUTPUT:
CODE CODE2 OPTION OPTION2
001 A 89 111111
001 A 55 89
001 A 12 55
002 A 59 111111
002 B 25 59
025 A 44 111111
How could I get this?
What transformations should I use?
Thanks! ^^
You can sort it by code and descending order of option. Then in an expression variable hold the value for the previous record's value in a variable.
v_OPTION2 = IIF(ISNULL(v_PREV_CODE) OR CODE != v_PREV_CODE,
111111,
v_PREV_OPTION
)
out_OPTION2 = v_OPTION2
v_PREV_OPTION = OPTION
v_PREV_CODE = CODE

PIG - retrieve data from XML using XPATH

I have n number of these type of xml files.
<students roll_no=1>
<name>abc</name>
<gender>m</gender>
<maxmarks>
<marks>
<year>2014</year>
<maths>100</maths>
<english>100</english>
<spanish>100</spanish>
<marks>
<marks>
<year>2015</year>
<maths>110</maths>
<english>110</english>
<spanish>110</spanish>
<marks>
</maxmarks>
<marksobt>
<marks>
<year>2014</year>
<maths>90</maths>
<english>95</english>
<spanish>82</spanish>
<marks>
<marks>
<year>2015</year>
<maths>94</maths>
<english>98</english>
<spanish>02</spanish>
<marks>
</marksobt>
</Students>
I need output like
roll_no name gender year eng_max_marks maths_max_marks spanish_max_marks
1 abc m 2014 100 100 100
1 abc m 2015 110 110 110
I am able to retrieve marks row wise in single statement but not able to extract roll_no and name with this.
A = LOAD 'student.xml' using org.apache.pig.piggybank.storage.XMLLoader('marks') as (x:chararray);
B = FOREACH A GENERATE XPath(x, 'marks/year'), XPath(x, 'marks/english'), XPath(x, 'marks/math'), XPath(x, 'marks/spanish');
This return
year eng_max_marks maths_max_marks spanish_max_marks
2014 100 100 100
2015 110 110 110
I can extract both the chunks but not getting how to join other fields. I can't use across join because I have n number of other files.
Let's forger attribute name (roll_no) for now. How can I extract the rest of nodes
name gender year eng_max_marks maths_max_marks spanish_max_marks
abc m 2014 100 100 100
abc m 2015 110 110 110
I don't want to use marks(1)/english approach because this nodes can also vary and don't want to adopt any dirty approach.
Any pointers????

Pandas performance issue of dataframe column "rename" and "drop"

Below is the line_profiler record of a function :
Wrote profile results to FM_CORE.py.lprof
Timer unit: 2.79365e-07 s
File: F:\FM_CORE.py
Function: _rpt_join at line 1068
Total time: 1.87766 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1068 #profile
1069 def _rpt_join(dfa, dfb, join_type='inner'):
1070 ''' join two dataframe together by ('STK_ID','RPT_Date') multilevel index.
1071 'join_type' can be 'inner' or 'outer'
1072 '''
1073
1074 2 56 28.0 0.0 try: # ('STK_ID','RPT_Date') are normal column
1075 2 2936668 1468334.0 43.7 rst = pd.merge(dfa, dfb, how=join_type, on=['STK_ID','RPT_Date'], left_index=True, right_index=True)
1076 except: # ('STK_ID','RPT_Date') are index
1077 rst = pd.merge(dfa, dfb, how=join_type, left_index=True, right_index=True)
1078
1079
1080 2 81 40.5 0.0 try: # handle 'STK_Name
1081 2 426472 213236.0 6.3 name_combine = pd.concat([dfa.STK_Name, dfb.STK_Name])
1082
1083
1084 2 900584 450292.0 13.4 nameseries = name_combine[-Series(name_combine.index.values, name_combine.index).duplicated()]
1085
1086 2 1138140 569070.0 16.9 rst.STK_Name_x = nameseries
1087 2 596768 298384.0 8.9 rst = rst.rename(columns={'STK_Name_x': 'STK_Name'})
1088 2 722293 361146.5 10.7 rst = rst.drop(['STK_Name_y'], axis=1)
1089 except:
1090 pass
1091
1092 2 94 47.0 0.0 return rst
What surprise me is these two lines:
1087 2 596768 298384.0 8.9 rst = rst.rename(columns={'STK_Name_x': 'STK_Name'})
1088 2 722293 361146.5 10.7 rst = rst.drop(['STK_Name_y'], axis=1)
Why a simple dataframe column "rename" and "drop" operation costs that much percentage of time (8.9% + 10.7%)? Anyway, the "merge" operation only costs 43.7% , and "rename"/"drop" looks not like a calculation-intensive operation. How to improve it ?

Resources