I am new to Pig. What would be the efficient way for parsing data like this? I am looking at picking each field value after = operator like date, time, devname, etc.
Jun 24 05:25:01 23.45.56.222 date=2014-06-24 time=05:04:43 devname=XX-FGT-Primary
device_id=FG3K8A3408600390 log_id=0021000002 type=traffic subtype=allowed pri=notice
vd=XX-Internet src=23.83.57.99 src_port=7569 src_int="amc-sw1/2" dst=23.91.19.16
dst_port=343 dst_int="amc-sw1/1" SN=116445695565 status=accept policyid=2272
dst_country="India" src_country="India" dir_disp=org tran_disp=noop service=HTTPS
proto=6 duration=122 sent=124 rcvd=84 sent_pkt=3 rcvd_pkt=2
Any code snippets would really help.
I think you are looking for UDF called REGEX_EXTRACT_ALL.
And for code snippet, look here.
Related
Let's assume that my template is like a following
string1=${obj.firstString}
string2=${obj.secondString}
number1=${obj.firstNumber}
I'm looking for some automatic way to wrap all my string parameters with single quotas? The expected output is
string1='A'
string2='B'
number1=42
I understand that I can write string1=${"'" + obj.firstString + "'"} , but maybe there is some more conventional way for this requirement...
Thanks a lot!
I would just do this:
string1='${obj.firstString}'
string2='${obj.secondString}'
number1=${obj.firstNumber}
It's a template language, so the basic idea is to make your program look similar to its own output.
I see requests to socket.io containing parameter t to be like LZywzeV, LZz5lk7 and similar.
All examples that i found so far used second- or millisecond-based UNIX timestamps.
Has anyone ever seen a timestamp format like this? (It is not base64-encoded).
I started looking a site that uses Socket.io today, and got the same problem, trying to look for the protocol definition was useless.
I figured this format is something called yeast
TBH, really don't know why people invent this sort of things instead of use
base64(timestamp.getBytes())
pseudocode instead.
A yeast decode algorithm in Python is as follow:
from datetime import datetime
a='0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-_'
b={a[i]: i for i in range(len(a))}
c=0
for d in "LZywzeV":
c=c*64+b[d]
print(c)
print(datetime.fromtimestamp(c/1000))
The output of that code is:
1481712065055
2016-12-14 07:41:05
to #jeremoquai:
It is easy, is matter of invert the algorithm:
def yeast(d):
r=""
while d!=0:
r='0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-_'[d&63]+r
d>>=6
return r
so, if you run
yeast(1481712065055)
it returns LZywzeV
I am trying to create a data field using Mockaroo and they say they have Ruby support but I know nothing about Ruby so I am trying to find out how to do a field that will randomly choose between the 3 options.
now() or
now()+days(-1) or
now()+days(-2) or
now()+days(-3)
Idea 1
I was initially thinking something using random like now()+days(this.rand(-3))
Idea 2
I also thought of using or logic like now() or now()+days(-3) or etc...
This answer may end up being different than a typical ruby solution since Mockaroo has their own little API to use too... Appreciate any help that can be given.
Turns out I had to use the random function first and pass in min and max date parameters.
random(now()+days(2), now()+days(-3))
I have a list of the following codes and their corresponding codes in format code_128. I want to given a string, be able to generate the corresponging code in CODE_128 format. Based on this list, how could I generate a code_128 number to the string A4Y9387VY34, for example?
code code in code_128
A4Y9387VY34 ????
ADN38Y644YT7 9611019020018632869509
AXCW99QYTD34 9622019021500078083444
A9YQC44W9J3K 9611083009710754539701
AT8V7T3G3874 9622083021255845940154
A7K444N4FKB8 9622083033510467186874
AYCHFW448HTQ 9611005019246067403120
AY63CWBMTDCC 9622005028182439033426
ANY7TF46NGQ3 9622005031345848081170
AYY48TBVQ3FH 9611200003793988696055
AT8Q4CF4DQ9Q 9611200021606968867090
A764WYQFJWTT 9622200022706968919275
AC649ND7N8B6 9622148007265209832185
A4VDPTJ99YN4 9611148013412173923039
AHDYK498BD6T 9622148021309216149530
A4YYYNY7C3DJ 9611017021934363499071
AYG6XWVCCQ89 9622017031009914238743
A68YJHGQKCCM 9622017031138587166053
APMB7XG9XQC9 9611021011608391750002
AGP8C44Y8VYK 9622021021608111646113
A7C68B9T69XB 9622021021958603678086
AJYYWKR6BDGN 9611010022528724015883
AKMNVXDT9PYN 9622010027475034102229
AXPXMK9QMDFD 9622010031475028243694
I read a lot about it, but I didn't come to any solution. Thanks in advance!!
Well, this is a pretty open question, I will give you my suggestions:
If it is a finite list, you can use a Hash or a Dictionary, where
the keys are the Codes and map them to the corresponding value, in
your case, Code_128
Some scanners have software installed that allow you to change what
has been read to a new value, format it, etc.
If you need a bigger insight please, give us more detail about the environment you are using.
Hope that helps,
I decided to create a new answer because now I get your point. Well, if you are talking about a GS1-128 Code (please see www.gs1.org) please do not start without visiting Wikipedia info about it. as you can see, there is a thorough explanation about how to work with that type of code. That code is composed by several application identifiers followed by their corresponding values. There is a better way of encoding them by using special characters as parenthesis. Here is other info that may help you.
Hope it helps,
In pig I massaged my data into something like:
(a,{(b,c),(d,e),(f,g)})
(h,{(i,j),(k,l)})
where the first item is the group and the bag are other values related to the group. I would like to get it into the following format:
(a,b,c,d,e,f,g)
(h,i,j,k,l)
I got to where I am now with
grunt> j = foreach G {
>> o = order myvar by second;
>> generate group, o.(first,second);
>> };
So the tuples in the bag are currently ordered. If I do something like mystuff = foreach j generate group, flatten($1); I get it all flattened and un-grouped.
Is this possible in pig, and if so what command should I be looking at?
There is no way I can that can do what you want out of the box. You really need to use a user-defined function for this. I know it sucks because you have to write Java or Python code, but you'll find several situations where Pig just doesn't go far enough. Pig can be considered a data flow language and not so much of a programming language, which is why UDFs play such an important role: they bridge the gap.
My suggestion is you write a UDF that takes in the group and value bag as parameters. Do the ordering/sorting in the UDF and also the flattening.
The other thing you want to be careful about is that now your rows will have different numbers of columns and Pig doesn't really like this. If you are just immediately outputting it, you can probably get away with this. You might want to consider having your UDF write out the list in a tab-delimited string or something that is preformatted. This isn't that big of a deal... feel free to ignore my advice here.