Pig Scripting - Cast STRING to INT - hadoop

Beginner in Pig, Need help
For all NON - AlphaNumeric, Cast the STRING TO INT
- To be handled without passing each field name separately.
Sample data -
00013425731998101620140402300032736901 00000000AAA001200X111685V00000000
00283335542006120920131010300030003105 00000000AAA001200X117407 00000000
00000000331998101620140402300033128107 00000000AAA001200X111685 00000000
00003902331999090620140402300032545208 00000000AAA001200X111685 00000000
Its a fixedwidth file, mapping details as follow -
orderNumber 1 9
origin 10 10
Startdate 11 18
ModDate 19 26
Identifier 27 36
Code 37 38
CodeType 39 40
Number 41 48
Num 49 114

Either use substr to extract the parts and then cast them or use a regexp. For example for the first two fields:
input = load ... as (line:chararray);
a = foreach input generate SUBSTRING(line, 0, 9) as orderNumber:long, SUBSTRING(line, 9, 10) as origin:chararray;
This way you should be able to convert each part of the input line into the desired components.
Alternatively you could write a UDF that takes a string as input and does the splitting and returns a bag or tuple.

Related

SNMP - Decode Hex String Value

This is my first question here, so hope it's correctly done.
Im trying to get some information from a ZTE C300 OLT.
The thing is when i try to get the SN of one of the ONTS I get the response in HEX-String
snmpwalk -cpublic -v2c [OLTIP] 1.3.6.1.4.1.3902.1082.500.10.2.2.5.1.2
And this is the response that I get
SNMPv2-SMI::enterprises.3902.1082.500.10.2.2.5.1.2.285278736.1 = Hex-STRING: 5A 54 45 47 C8 79 9B 27
This is the SN that i have on the OLT ZTEGC8799B27, but im trying to convert the HEX-STRING into text and i don't get that SN text.
Indeed i have a python script for SNMP and the response that i get for that OID is
{'1.3.6.1.4.1.3902.1082.500.10.2.2.5.1.2.285278736.1': "ZTEGÈy\x9b'"}
Can someone give me a hand on this?. I'm new on SNMP and this is giving me some headache.
Thanks in advace!
This is a 8 octet hex string, the first 4 octets are ASCII.
Just convert hex 2 ascii.
Indeed it was easier. The firts 4 bytes were encoded, and the other 4 is the actual serial number splitted every 2 digits. So i only need to decode the first part and concatenate the rest.
Works with OLT ZTE C320
def hex_str(str):
str = str.strip()
str = str.split(' ')
vendor_id = ''
serial = str[4:]
serial = "".join(serial)
for hex_byte in str[:4]:
vendor_id += chr(int(hex_byte, 16))
normalized_serial = vendor_id + serial
return normalized_serial
def ascii_to_hex(str):
arr = []
hex_byte = ''
for i in range(len(str)):
hex_byte += hex(ord(str[i]))
hex_byte = hex_byte.replace('0x', ' ')
hex_byte = hex_str(hex_byte)
return hex_byte
# value = f"5A 54 45 47 C8 79 9B 27 "
# value = f"49 54 42 53 8B 69 A2 45 "
# value = f"ZTEGÈy\x9b'"
value = f"ITBS2Lz/"
# value = f"ITBS2HP#"
if(len(value) == 24):
print(hex_str(value))
else:
print(ascii_to_hex(value))

removing bad data from a data file using pig

I have a data file like this
1943 49 1
1975 91 L
1903 56 3
1909 52 3
1953 96 3
1912 82
1976 66 3
1913 35
1990 45 1
1927 92 A
1912 2
1924 22
1971 2
1959 94 E
now using pig script I want to remove the bad data like removing those rows which have characters and empty fields
I tried this way
records = load '/user/a106524609/test.txt' using PigStorage(' ') as
(year:chararray, temperature:int, quality:int);
rec1 = filter records by temperature != 'null' and (quality != 'null ')
Load it as lines
A = load 'data.txt' using PigStorage('\n') as (line:chararray);
Split on all whitespaces
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '\\s+')) as (year:int,temp:int,quality:chararray);
Filter by valid strings
C = FILTER B BY quality IN ('0','1','2','3','4','5','6','7','8','9');
(Optionally) Cast to an int
D = FOREACH C GENERATE year,temp,(int)quality;
In Spark, I would start with a regex match of the expected format.
val cleanRows = sc.textFile("data.txt")
.filter(line => line.matches("(?:\\d+\\s+){2}\\d+"))

Different Code 128 barcode symbols representing the same data

I'm currently using software called LineView. It generates downtime reason codes for our factory lines. An operator scans the barcodes with an RS232 scanner and it goes into our XL board system.
The software itself generates the barcodes within an internet browser, but I am trying to make it so our own labeling machine can also print out the barcodes. However, the barcodes that are produced by the labeler (and the many online barcode generators I've tried) look longer and do not work.
The data for the example 128 barcode that I am trying to replicate is [SOH]1[STX]65;1067[ETX].
According to the manual:
- The Start of Header character (ASCII 0x01) starts the XL Command packet.
1 - The Serial Address of the XL device (the default is 1).
- The Start of Transmission character (ASCII 0x02) marks the start of the actual command.
65; - The ID of the Production State > Set Reason Code command.
The Reason Code ID (which can range from 1 to 999 for system reasons or 1000 to 1999 for user defined reasons). In my case it is 1067
- The End of Transmission character (ASCII 0x03) ends the XL Command packet.
I have attatched the pictures of what LineView produces (which is what I want it to look like) and what it is currently printing like on our labeller.
When I scan them they both come up with the [SOH]1[STX]65;1067[ETX] code despite them looking different.
Any help with this would be very much appreciated.
Your intended barcode is constructed internally using the following series of Code 128 codewords which correctly represent the ASCII control characters:
103 Start-in-Mode-A (Upper-case and control characters)
65 [SOH] (ASCII 1)
17 1
66 [STX] (ASCII 2)
22 6
21 5
27 ;
99 Switch-to-Mode-C (Double-density numeric)
10 10
67 67
101 Switch-to-Mode-A
67 [ETX] (ASCII 3)
67 Check-digit
106 Stop
Your label printer is printing a barcode representing the literal string [SOH]1[STX]65;1067[ETX] with no ASCII control characters (i.e. left-bracket, S, O, H, right-bracket, ...) using the following internal codewords:
104 Start-in-Mode-B (Mixed-case)
59 [
51 S
47 O
40 H
61 ]
17 1
59 [
51 S
52 T
56 X
61 ]
22 6
21 5
27 ;
99 Switch-to-Mode-C (Double-density numeric)
10 10
67 67
100 Switch-to-Mode-B
59 [
37 E
52 T
56 X
61 ]
57 Check-digit
106 Stop
So you need to work out how to correctly specify ASCII control characters in the input to your labelling machine.

Arcgis field calculations, get last dits of number

I want get 2 last digits of numbers, this digits also can be string not int
Cell:
29501864,071879
17906796,472795
17038547,962973
182638877,306748
101159098,3431
183391558,187717
VB script function: Right(CStr( [tracts.Shape_Area] ),2)
What I get:
94
47
32
48
31
17
What I want get
79
95
73
48
31
17
What is wrong in my function ?
This code is correct:
Right(CStr( [Shape_Area] ),2)
When you import a FeatureClass in a geodatabase, the Shape_Area created as a Double field that can have many decimal places! (By default 8 decimal places)
But in Attribute Table, you can see 6 decimal places by default. To see all the decimal places of your numbers try this:
CStr( [Shape_Area] )

reading in a text file with a SUB (1a) (Control-Z) character in R on Windows

Following on from my query last week reading badly formed csv in R - mismatched quotes, these same CSV files also have embedded control characters such as the ASCII Substitute Character which is decimal 26 or 0x1A. Unfortunately readLines() seems to truncate the line at this character, so I am having difficulty in matching quotes - apart from losing the later fields in these lines!
I have tried to readBin() but I can't get it to read this file. I'm afraid I can't cleanly read this into R to give you an example and I'm having difficulty in creating these in R. Sorry not to be able to demonstrate with a clean example. Thoughts?
Update
Now I'm confused - when I use the code
h3 <- paste('1,34,44.4,"', rawToChar(as.raw(c(as.integer(k1), 26, 65))), '",99')
identical(readLines(textConnection(h3)), h3)
I get TRUE which I find quite surprising!
Update 2
h3
[1] "1,34,44.4,\" HIJK\032A \",99"
> writeLines(h3, 'h3.txt')
> h3a <- readLines('h3.txt')
Warning message:
In readLines("h3.txt") : incomplete final line found on 'h3.txt'
> h3a
[1] "1,34,44.4,\" HIJK"
So readLines() reacts differently when coming from a textConnection() and it silently truncates at the SUB character.
I would be surprised if it makes a difference but I'm on 2.15.2 on Windows-64.
Update 3
Some vague success in solving this...
zb <- file('h3.txt', "rb")
tmp <- readBin(zb, raw(), size=1, n=400) # raw is always of size =1
nchar(tmp)
# [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
close(zb)
tmp
# [1] 31 2c 33 34 2c 34 34 2e 34 2c 22 20 48 49 4a 4b 1a 41 20 22 2c 39 39 0d 0a
rawToChar(tmp)
# [1] "1,34,44.4,\" HIJK\032A \",99\r\n"
i.e. if I read in the file as binary and convert to character() afterwards it seems to work... this will be tedious for large CSV files...
Could there be a bug in R in incorrectly detecting a Control-Z as end of file on windows??
I think I've figured out a solution - because there appears to be a problem reading a Control-Z in the middle of a file on Windows, we need to read the file in binary / raw mode.
fnam <- 'h3.txt'
tmp.bin <- readBin(fnam, raw(), size=1, n=max(2*file.info(dfnam)$size, 100))=1
tmp.char <- rawToChar(tmp.bin)
txt <- unlist(strsplit(tmp.char, '\r\n', fixed=TRUE))
txt
[1] "1,34,44.4,\" HIJK\032A \",99"
Update
The following better answer was posted by Duncan Murdoch to R-Devel refer. Converting it into a function I get:
sReadLines <- function(fnam) {
f <- file(fnam, "rb")
res <- readLines(f)
close(f)
res
}
I also ran into this problem when I used read.csv with a csv file that contained the SUB or CTRL-Z in the middle of the file.
Solved it with the readr package (if your file is comma separated)
library(readr)
read_csv("h3.txt")
If you have a ; as a separator, then use:
library(readr)
read_csv2("h3.txt")

Resources