removing bad data from a data file using pig - hadoop

I have a data file like this
1943 49 1
1975 91 L
1903 56 3
1909 52 3
1953 96 3
1912 82
1976 66 3
1913 35
1990 45 1
1927 92 A
1912 2
1924 22
1971 2
1959 94 E
now using pig script I want to remove the bad data like removing those rows which have characters and empty fields
I tried this way
records = load '/user/a106524609/test.txt' using PigStorage(' ') as
(year:chararray, temperature:int, quality:int);
rec1 = filter records by temperature != 'null' and (quality != 'null ')

Load it as lines
A = load 'data.txt' using PigStorage('\n') as (line:chararray);
Split on all whitespaces
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '\\s+')) as (year:int,temp:int,quality:chararray);
Filter by valid strings
C = FILTER B BY quality IN ('0','1','2','3','4','5','6','7','8','9');
(Optionally) Cast to an int
D = FOREACH C GENERATE year,temp,(int)quality;
In Spark, I would start with a regex match of the expected format.
val cleanRows = sc.textFile("data.txt")
.filter(line => line.matches("(?:\\d+\\s+){2}\\d+"))

Related

Count number of connections in each network using DAX

I have a dataset like this in Power BI with connections between "Participant ID" Column and "Knows Participant":
Participant ID
Knows Participant
111
353
111
777
111
112
111
249
112
143
112
144
113
111
113
244
114
NaN
115
113
...
...
777
111
777
398
777
114
778
NaN
779
112
3499
NaN
I've build Network chart. However, there are a lot of 1-1 connections that are not very useful for visualization, so I want to exclude them (see image):
Is it possible to count a number of connections in each network using DAX and then use this value to filter out all nodes with only 1 connection (red circled)? Or maybe filter out 1 connection nodes using another approach?
I've tried to make a calculated column using DAX:
Connection Column = COUNTROWS(
FILTER(Table,
EARLIER(Table[Knows Participant])=Table[Knows Participant])
)
However, it only shows duplicate values in "Knows Participant" Column, but not number of connections in each network.
Example of desired output:
Participant ID
Knows Participant
Number of Connections in the Network
111
353
4
353
444
4
444
551
4
551
987
4
112
143
1
220
190
1
333
337
2
337
410
2
765
0
You need the PATH functions as you're essentially trying to flatten a hierarchy and then exclude certain parts of it. The following help page gives a good rundown of the approach to take.
https://learn.microsoft.com/en-us/dax/understanding-functions-for-parent-child-hierarchies-in-dax
You can add a column to the table with a measure like this:
VAR pIdLinksCount = CALCULATE(COUNTROWS(tbl), ALL('tbl'[Knows Participant]))
VAR neighbourLinksCount =
IF(
pIdLinksCount=1
, -- if pIdLinksCount=1 then count neighbour links
VAR neighbourId =
CALCULATETABLE(
Values('tbl'[Knows Participant])
)
RETURN
CALCULATE(
COUNTROWS(tbl)
,ALL() -- removes all filters from data model
,'tbl'[Participant ID] = neighbourId -- applies filter to [Participant ID] column
--,'tbl'[Participant ID] IN neighbourId -- alternatively try this. I believe it is not necessary
)
,2 -- returns 2 if pIdLinksCount>1.
-- The "value = 2" will return "result > 3 = TRUE()"
)
VAR result = pIdLinksCount + neighbourLinksCount
RETURN
IF(
result>2
,1
,0
)
The idea is to check a neighbor too - if it has more then 1 link

SNMP - Decode Hex String Value

This is my first question here, so hope it's correctly done.
Im trying to get some information from a ZTE C300 OLT.
The thing is when i try to get the SN of one of the ONTS I get the response in HEX-String
snmpwalk -cpublic -v2c [OLTIP] 1.3.6.1.4.1.3902.1082.500.10.2.2.5.1.2
And this is the response that I get
SNMPv2-SMI::enterprises.3902.1082.500.10.2.2.5.1.2.285278736.1 = Hex-STRING: 5A 54 45 47 C8 79 9B 27
This is the SN that i have on the OLT ZTEGC8799B27, but im trying to convert the HEX-STRING into text and i don't get that SN text.
Indeed i have a python script for SNMP and the response that i get for that OID is
{'1.3.6.1.4.1.3902.1082.500.10.2.2.5.1.2.285278736.1': "ZTEGÈy\x9b'"}
Can someone give me a hand on this?. I'm new on SNMP and this is giving me some headache.
Thanks in advace!
This is a 8 octet hex string, the first 4 octets are ASCII.
Just convert hex 2 ascii.
Indeed it was easier. The firts 4 bytes were encoded, and the other 4 is the actual serial number splitted every 2 digits. So i only need to decode the first part and concatenate the rest.
Works with OLT ZTE C320
def hex_str(str):
str = str.strip()
str = str.split(' ')
vendor_id = ''
serial = str[4:]
serial = "".join(serial)
for hex_byte in str[:4]:
vendor_id += chr(int(hex_byte, 16))
normalized_serial = vendor_id + serial
return normalized_serial
def ascii_to_hex(str):
arr = []
hex_byte = ''
for i in range(len(str)):
hex_byte += hex(ord(str[i]))
hex_byte = hex_byte.replace('0x', ' ')
hex_byte = hex_str(hex_byte)
return hex_byte
# value = f"5A 54 45 47 C8 79 9B 27 "
# value = f"49 54 42 53 8B 69 A2 45 "
# value = f"ZTEGÈy\x9b'"
value = f"ITBS2Lz/"
# value = f"ITBS2HP#"
if(len(value) == 24):
print(hex_str(value))
else:
print(ascii_to_hex(value))

Pig Scripting - Cast STRING to INT

Beginner in Pig, Need help
For all NON - AlphaNumeric, Cast the STRING TO INT
- To be handled without passing each field name separately.
Sample data -
00013425731998101620140402300032736901 00000000AAA001200X111685V00000000
00283335542006120920131010300030003105 00000000AAA001200X117407 00000000
00000000331998101620140402300033128107 00000000AAA001200X111685 00000000
00003902331999090620140402300032545208 00000000AAA001200X111685 00000000
Its a fixedwidth file, mapping details as follow -
orderNumber 1 9
origin 10 10
Startdate 11 18
ModDate 19 26
Identifier 27 36
Code 37 38
CodeType 39 40
Number 41 48
Num 49 114
Either use substr to extract the parts and then cast them or use a regexp. For example for the first two fields:
input = load ... as (line:chararray);
a = foreach input generate SUBSTRING(line, 0, 9) as orderNumber:long, SUBSTRING(line, 9, 10) as origin:chararray;
This way you should be able to convert each part of the input line into the desired components.
Alternatively you could write a UDF that takes a string as input and does the splitting and returns a bag or tuple.

Calculation of application speedup using gnuplot and awk

Here's the problem:
Speedup formula: S(p) = T(1)/T(p) = (avg time for one process / avg time for p processes)
There are 5 logs, from which one wants to extract the information.
cg.B.1.log contains the execution times for one process, so we do the calculation of the average time to obtain T(1). The other log files contain the execution times for 2, 4, 8 and 16 processes. Averages of those times must also be calculated, since they are T(p).
Here's the code that calculates the averages:
tavg(n) = "awk 'BEGIN { FS = \"[ \\t]*=[ \\t]*\" } /Time in seconds/ { s += $2; c++ } /Total processes/ { if (! CP) CP = $2 } END { print s/c }' cg.B.".n.".log ".(n == 1 ? ">" : ">>")." tavg.dat;"
And the code that calculates the speedup:
system "awk 'NR==1{n=$0} {print n/$0}' tavg.dat > speedup.dat;"
How do I combine those two commands so that the output 'speedup.dat' is produced directly without using file tavg.dat?
Here are the contents of files, the structure of all log files is identical. I attached only the first two executions for abbreviation purposes.
cg.B.1.log
-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-
Start in 16:45:15--25/12/2014
NAS Parallel Benchmarks 3.3 -- CG Benchmark
Size: 75000
Iterations: 75
Number of active processes: 1
Number of nonzeroes per row: 13
Eigenvalue shift: .600E+02
iteration ||r|| zeta
1 0.30354859861452E-12 59.9994751578754
2 0.11186435488267E-14 21.7627846142536
3 0.11312258511928E-14 22.2876617043224
4 0.11222160585284E-14 22.5230738188346
5 0.11244234177219E-14 22.6275390653892
6 0.11330434819384E-14 22.6740259189533
7 0.11334259623050E-14 22.6949056826251
8 0.11374839313647E-14 22.7044023166872
9 0.11424877443039E-14 22.7087834345620
10 0.11329475190566E-14 22.7108351397177
11 0.11337364093482E-14 22.7118107121341
12 0.11379928308864E-14 22.7122816240971
13 0.11369453681794E-14 22.7125122663243
14 0.11430390337015E-14 22.7126268007594
15 0.11400318886400E-14 22.7126844161819
16 0.11352091331197E-14 22.7127137461755
17 0.11350923439124E-14 22.7127288402000
18 0.11475378864565E-14 22.7127366848296
19 0.11366777929028E-14 22.7127407981217
20 0.11274243312504E-14 22.7127429721364
21 0.11353930792856E-14 22.7127441294025
22 0.11299685800278E-14 22.7127447493900
23 0.11296405041170E-14 22.7127450834533
24 0.11381975597887E-14 22.7127452643881
25 0.11328127301663E-14 22.7127453628451
26 0.11367332658939E-14 22.7127454166517
27 0.11283372178605E-14 22.7127454461696
28 0.11384734158863E-14 22.7127454624211
29 0.11394011989719E-14 22.7127454713974
30 0.11354294067640E-14 22.7127454763703
31 0.11412988029103E-14 22.7127454791343
32 0.11358088407717E-14 22.7127454806740
33 0.11263266152515E-14 22.7127454815316
34 0.11275183080286E-14 22.7127454820131
35 0.11328306951409E-14 22.7127454822840
36 0.11357880314891E-14 22.7127454824349
37 0.11332687790488E-14 22.7127454825202
38 0.11324108818137E-14 22.7127454825684
39 0.11365065523777E-14 22.7127454825967
40 0.11361185361321E-14 22.7127454826116
41 0.11276519820716E-14 22.7127454826202
42 0.11317183424878E-14 22.7127454826253
43 0.11236007481770E-14 22.7127454826276
44 0.11304065564684E-14 22.7127454826296
45 0.11287791356431E-14 22.7127454826310
46 0.11297028000133E-14 22.7127454826310
47 0.11281236869666E-14 22.7127454826314
48 0.11277254075548E-14 22.7127454826317
49 0.11320327289847E-14 22.7127454826309
50 0.11287655285563E-14 22.7127454826321
51 0.11230503422400E-14 22.7127454826324
52 0.11292089094944E-14 22.7127454826313
53 0.11366728396408E-14 22.7127454826315
54 0.11222618466968E-14 22.7127454826310
55 0.11278193276516E-14 22.7127454826315
56 0.11244624896030E-14 22.7127454826316
57 0.11264508872685E-14 22.7127454826318
58 0.11255583774760E-14 22.7127454826314
59 0.11227129146723E-14 22.7127454826314
60 0.11189480800173E-14 22.7127454826318
61 0.11163241472678E-14 22.7127454826315
62 0.11278839424218E-14 22.7127454826318
63 0.11226804133008E-14 22.7127454826313
64 0.11222456601361E-14 22.7127454826317
65 0.11270879524310E-14 22.7127454826308
66 0.11303771390006E-14 22.7127454826319
67 0.11240101357287E-14 22.7127454826319
68 0.11240278884391E-14 22.7127454826321
69 0.11207748067718E-14 22.7127454826317
70 0.11178755187571E-14 22.7127454826327
71 0.11195935245649E-14 22.7127454826313
72 0.11260715126337E-14 22.7127454826322
73 0.11281677964997E-14 22.7127454826316
74 0.11162340034815E-14 22.7127454826318
75 0.11208709203921E-14 22.7127454826310
Benchmark completed
VERIFICATION SUCCESSFUL
Zeta is 0.2271274548263E+02
Error is 0.3128387698896E-15
CG Benchmark Completed.
Class = B
Size = 75000
Iterations = 75
Time in seconds = 88.72
Total processes = 1
Compiled procs = 1
Mop/s total = 616.64
Mop/s/process = 616.64
Operation type = floating point
Verification = SUCCESSFUL
Version = 3.3
Compile date = 25 Dec 2014
Compile options:
MPIF77 = mpif77
FLINK = $(MPIF77)
FMPI_LIB = -L/usr/lib/openmpi/lib -lmpi -lopen-rte -lo...
FMPI_INC = -I/usr/lib/openmpi/include -I/usr/lib/openm...
FFLAGS = -O
FLINKFLAGS = -O
RAND = randi8
Please send the results of this run to:
NPB Development Team
Internet: npb#nas.nasa.gov
If email is not available, send this to:
MS T27A-1
NASA Ames Research Center
Moffett Field, CA 94035-1000
Fax: 650-604-3957
Finish in 16:46:46--25/12/2014
-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-
-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-
Start in 17:03:13--25/12/2014
NAS Parallel Benchmarks 3.3 -- CG Benchmark
Size: 75000
Iterations: 75
Number of active processes: 1
Number of nonzeroes per row: 13
Eigenvalue shift: .600E+02
iteration ||r|| zeta
1 0.30354859861452E-12 59.9994751578754
2 0.11186435488267E-14 21.7627846142536
3 0.11312258511928E-14 22.2876617043224
4 0.11222160585284E-14 22.5230738188346
5 0.11244234177219E-14 22.6275390653892
6 0.11330434819384E-14 22.6740259189533
7 0.11334259623050E-14 22.6949056826251
8 0.11374839313647E-14 22.7044023166872
9 0.11424877443039E-14 22.7087834345620
10 0.11329475190566E-14 22.7108351397177
11 0.11337364093482E-14 22.7118107121341
12 0.11379928308864E-14 22.7122816240971
13 0.11369453681794E-14 22.7125122663243
14 0.11430390337015E-14 22.7126268007594
15 0.11400318886400E-14 22.7126844161819
16 0.11352091331197E-14 22.7127137461755
17 0.11350923439124E-14 22.7127288402000
18 0.11475378864565E-14 22.7127366848296
19 0.11366777929028E-14 22.7127407981217
20 0.11274243312504E-14 22.7127429721364
21 0.11353930792856E-14 22.7127441294025
22 0.11299685800278E-14 22.7127447493900
23 0.11296405041170E-14 22.7127450834533
24 0.11381975597887E-14 22.7127452643881
25 0.11328127301663E-14 22.7127453628451
26 0.11367332658939E-14 22.7127454166517
27 0.11283372178605E-14 22.7127454461696
28 0.11384734158863E-14 22.7127454624211
29 0.11394011989719E-14 22.7127454713974
30 0.11354294067640E-14 22.7127454763703
31 0.11412988029103E-14 22.7127454791343
32 0.11358088407717E-14 22.7127454806740
33 0.11263266152515E-14 22.7127454815316
34 0.11275183080286E-14 22.7127454820131
35 0.11328306951409E-14 22.7127454822840
36 0.11357880314891E-14 22.7127454824349
37 0.11332687790488E-14 22.7127454825202
38 0.11324108818137E-14 22.7127454825684
39 0.11365065523777E-14 22.7127454825967
40 0.11361185361321E-14 22.7127454826116
41 0.11276519820716E-14 22.7127454826202
42 0.11317183424878E-14 22.7127454826253
43 0.11236007481770E-14 22.7127454826276
44 0.11304065564684E-14 22.7127454826296
45 0.11287791356431E-14 22.7127454826310
46 0.11297028000133E-14 22.7127454826310
47 0.11281236869666E-14 22.7127454826314
48 0.11277254075548E-14 22.7127454826317
49 0.11320327289847E-14 22.7127454826309
50 0.11287655285563E-14 22.7127454826321
51 0.11230503422400E-14 22.7127454826324
52 0.11292089094944E-14 22.7127454826313
53 0.11366728396408E-14 22.7127454826315
54 0.11222618466968E-14 22.7127454826310
55 0.11278193276516E-14 22.7127454826315
56 0.11244624896030E-14 22.7127454826316
57 0.11264508872685E-14 22.7127454826318
58 0.11255583774760E-14 22.7127454826314
59 0.11227129146723E-14 22.7127454826314
60 0.11189480800173E-14 22.7127454826318
61 0.11163241472678E-14 22.7127454826315
62 0.11278839424218E-14 22.7127454826318
63 0.11226804133008E-14 22.7127454826313
64 0.11222456601361E-14 22.7127454826317
65 0.11270879524310E-14 22.7127454826308
66 0.11303771390006E-14 22.7127454826319
67 0.11240101357287E-14 22.7127454826319
68 0.11240278884391E-14 22.7127454826321
69 0.11207748067718E-14 22.7127454826317
70 0.11178755187571E-14 22.7127454826327
71 0.11195935245649E-14 22.7127454826313
72 0.11260715126337E-14 22.7127454826322
73 0.11281677964997E-14 22.7127454826316
74 0.11162340034815E-14 22.7127454826318
75 0.11208709203921E-14 22.7127454826310
Benchmark completed
VERIFICATION SUCCESSFUL
Zeta is 0.2271274548263E+02
Error is 0.3128387698896E-15
CG Benchmark Completed.
Class = B
Size = 75000
Iterations = 75
Time in seconds = 87.47
Total processes = 1
Compiled procs = 1
Mop/s total = 625.43
Mop/s/process = 625.43
Operation type = floating point
Verification = SUCCESSFUL
Version = 3.3
Compile date = 25 Dec 2014
Compile options:
MPIF77 = mpif77
FLINK = $(MPIF77)
FMPI_LIB = -L/usr/lib/openmpi/lib -lmpi -lopen-rte -lo...
FMPI_INC = -I/usr/lib/openmpi/include -I/usr/lib/openm...
FFLAGS = -O
FLINKFLAGS = -O
RAND = randi8
Please send the results of this run to:
NPB Development Team
Internet: npb#nas.nasa.gov
If email is not available, send this to:
MS T27A-1
NASA Ames Research Center
Moffett Field, CA 94035-1000
Fax: 650-604-3957
Finish in 17:04:43--25/12/2014
-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-
tavg.dat
88.3055
45.1482
37.7202
37.4035
53.777
speedup.dat
1
1.9559
2.34107
2.36089
1.64207
You can do it all in one awk script that processes all the log files:
#!/usr/bin/awk -f
BEGIN { FS="=" }
lfname != FILENAME { lfname = FILENAME; split(FILENAME, a, "."); fnum=a[3] }
/Time in seconds/ { tsecs[fnum] += $2; tcnt[fnum]++ }
/Total processes/ { cp[fnum] = int($2) }
END {
tavg1 = tsecs[1]/tcnt[1]
for( k in tsecs ) {
tavgk = tsecs[k]/tcnt[k]
if( tavgk > 0 ) {
print k OFS cp[k] OFS tavgk OFS tavg1/tavgk
}
}
}
If you put that in a file called awk.script and make it executable with chmod +x awk.script you can run it in bash like:
./awk.script cg.B.*.log
If you're using GNU awk, the output will be ordered( extra steps may be needed to ensure the output is ordered using other awk flavors ).
Where I generated a 2nd and 3rd file, the output is like:
1 1 88.095 1
2 2 68.095 1.29371
3 4 49.595 1.77629
where the unnamed columns are like: file number, # processes, avg per file, speedup. You could get just the speedups by changing the print in the END block to be like print tavg1/tavgk.
Here's a breakdown of the script:
Use a simpler field separator in BEGIN
lfname != FILENAME - parse out file number from the filename as fnum but only when the FILENAME changes.
/Time in seconds/ - store the values in tsecs and tcnt arrays with an fnum key. Use int() function to strip whitespace from processes value.
/Total processes/ - store the process in the cp array with an fnum key
END - Calculate the average for fnum 1 as tavg1, loop through the keys in tsecs and calculate the average by fnum key as tavgk. When tavgk > 0 print the output as described above.
You have figured out all the difficult parts already. You don't need the tavg.dat file at all. Create your tavg(n) function directly as a system call:
tavg(n) = system("awk 'BEGIN { FS = \"[ \\t]*=[ \\t]*\" } \
/Time in seconds/ { s += $2; c++ } /Total processes/ { \
if (! CP) CP = $2 } END { print s/c }' cg.B.".n.".log")
And a speedup(n) function as
speedup(n)=tavg(n)/tavg(1)
Now you can set print to write to a file:
set print "speedup.dat"
do for [i=1:5] {
print speedup(i)
}
unset print

reading in a text file with a SUB (1a) (Control-Z) character in R on Windows

Following on from my query last week reading badly formed csv in R - mismatched quotes, these same CSV files also have embedded control characters such as the ASCII Substitute Character which is decimal 26 or 0x1A. Unfortunately readLines() seems to truncate the line at this character, so I am having difficulty in matching quotes - apart from losing the later fields in these lines!
I have tried to readBin() but I can't get it to read this file. I'm afraid I can't cleanly read this into R to give you an example and I'm having difficulty in creating these in R. Sorry not to be able to demonstrate with a clean example. Thoughts?
Update
Now I'm confused - when I use the code
h3 <- paste('1,34,44.4,"', rawToChar(as.raw(c(as.integer(k1), 26, 65))), '",99')
identical(readLines(textConnection(h3)), h3)
I get TRUE which I find quite surprising!
Update 2
h3
[1] "1,34,44.4,\" HIJK\032A \",99"
> writeLines(h3, 'h3.txt')
> h3a <- readLines('h3.txt')
Warning message:
In readLines("h3.txt") : incomplete final line found on 'h3.txt'
> h3a
[1] "1,34,44.4,\" HIJK"
So readLines() reacts differently when coming from a textConnection() and it silently truncates at the SUB character.
I would be surprised if it makes a difference but I'm on 2.15.2 on Windows-64.
Update 3
Some vague success in solving this...
zb <- file('h3.txt', "rb")
tmp <- readBin(zb, raw(), size=1, n=400) # raw is always of size =1
nchar(tmp)
# [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
close(zb)
tmp
# [1] 31 2c 33 34 2c 34 34 2e 34 2c 22 20 48 49 4a 4b 1a 41 20 22 2c 39 39 0d 0a
rawToChar(tmp)
# [1] "1,34,44.4,\" HIJK\032A \",99\r\n"
i.e. if I read in the file as binary and convert to character() afterwards it seems to work... this will be tedious for large CSV files...
Could there be a bug in R in incorrectly detecting a Control-Z as end of file on windows??
I think I've figured out a solution - because there appears to be a problem reading a Control-Z in the middle of a file on Windows, we need to read the file in binary / raw mode.
fnam <- 'h3.txt'
tmp.bin <- readBin(fnam, raw(), size=1, n=max(2*file.info(dfnam)$size, 100))=1
tmp.char <- rawToChar(tmp.bin)
txt <- unlist(strsplit(tmp.char, '\r\n', fixed=TRUE))
txt
[1] "1,34,44.4,\" HIJK\032A \",99"
Update
The following better answer was posted by Duncan Murdoch to R-Devel refer. Converting it into a function I get:
sReadLines <- function(fnam) {
f <- file(fnam, "rb")
res <- readLines(f)
close(f)
res
}
I also ran into this problem when I used read.csv with a csv file that contained the SUB or CTRL-Z in the middle of the file.
Solved it with the readr package (if your file is comma separated)
library(readr)
read_csv("h3.txt")
If you have a ; as a separator, then use:
library(readr)
read_csv2("h3.txt")

Resources