Apache PIG - How to cut digits after decimal point - hadoop

Is there any possibility to cut a certain area after the decimal point of a float or double number?
For example: the result would be 2.67894 => I want to have 2.6 as result (and not 2.7 when rounded).

try it.. val is your values like 2.666,3.666,4.666666,5.3456334.....
b = foreach a GENERATE (FLOOR(val * 10) / 10);
dump b;

Write a UDF (User Defined Function) for this.
A very simple python UDF (numformat.py):
#outputSchema('value:double')
def format(data):
return round(data,1)
(Of course you can parametrized the UDF to use different precision.)
Than register and use it in your pig code. Example:
REGISTER numformat.py USING jython as numformat;
A = LOAD 'so/testdata.csv' USING PigStorage(',') AS (data:double);
B = FOREACH A GENERATE numformat.format(data);
DUMP B;
For the following input:
2.1234
12.334
The dumped result is:
(2.1)
(12.3)

Related

How do I parallelize for-loop in octave using pararrayfun (or any other function will also do)?

Well, I'm new to octave and i wanted to know how to implement parallel execution of for loop in octave.
I'm looking for parallel implementation of the below code (its not the exact code that I'm trying to execute, but something similar to this)
`%read a csv file
master_sheet = csv2cell('master_sheet.csv');
delta = 0.001;
nprocs= nproc();
%extract some values from the csv file and store it in the variables
a = master_sheet{34,2} ;
b = master_sheet{38,2} ;
c = master_sheet{39,2} ;
for i=0:1000
%%create variants of a,b and c by adding a delta value
a_adj = a +(i)*delta ;
b_adj = b +(i)*delta ;
c_adj = c +(i)*delta ;
%club all the above variables and put it to an array variable
array_abc = [a_adj, b_adj, c_adj];
%send this array as an argument/parameter to a function
%processingData() function would essentially perform some series of calculation and would write the
%results onto a file
processingData(array_abc);
endfor
Currently, I'm using parallel pkg (pararrayfun) to implement this, but if there is any other way(package) that could achieve the parallelization of for loop in octave, then I'm open to exploring that as well.
Thank you!

PIG Script to split large txt file into parts based on specified word

I am trying to build a pig script that takes in a textbook file and divides it into chapters and then compares the words in each chapter and returns only words that show up in all chapters and counts them. The chapters are Delimited fairly easily by CHAPTER - X.
Here's what I have so far:
lines = LOAD '../../Alice.txt' AS (line:chararray);
lineswithoutspecchars = FOREACH lines GENERATE REPLACE(line,'([^a-zA-Z\\s]+)','') as line;
words = FOREACH lineswithoutspecchars GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;
Sorry that this question is probably way too simple compared to what I normally ask on stackoverflow and I googled around for it but perhaps I am not using the correct keywords. I am brand new to PIG and trying to learn it for a new job assignment.
Thanks in advance!
A bit lengthy but you will get the result. You could cut down unnecessary relations based on your file though. Provided appropriate comments in teh script.
Input File:
Pig does not know whether integer values in baseball are stored as ASCII strings, Java
serialized values, binary-coded decimal, or some other format. So it asks the load func-
tion, because it is that function’s responsibility to cast bytearrays to other types. In
general this works nicely, but it does lead to a few corner cases where Pig does not know
how to cast a bytearray. In particular, if a UDF returns a bytearray, Pig will not know
how to perform casts on it because that bytearray is not generated by a load function.
CHAPTER - X
In a strongly typed computer language (e.g., Java), the user must declare up front the
type for all variables. In weakly typed languages (e.g., Perl), variables can take on values
of different type and adapt as the occasion demands.
CHAPTER - X
In this example, remember we are pretending that the values for base_on_balls and
ibbs turn out to be represented as integers internally (that is, the load function con-
structed them as integers). If Pig were weakly typed, the output of unintended would
be records with one field typed as an integer. As it is, Pig will output records with one
field typed as a double. Pig will make a guess and then do its best to massage the data
into the types it guessed.
Pig Script:
A = LOAD 'file' as (line:chararray);
B = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z\\s]+)','') as line;
//we need to split on CHAPTER X but the above load function would give us a tuple for each newline. so
group everything and convert that bag to string which will give a single tuple with _ as delimiter.
C = GROUP B ALL;
D = FOREACH C GENERATE BagToString(B) as (line:chararray);
//now we dont have any commas so convert our delimiter CHAPTER X to comma. We do this becuase if we pass this
to TOKENIZE it would split that into separarte column that would be useful to RANK it.
E = FOREACH D GENERATE REPLACE(line,'_CHAPTER X_',',') AS (line:chararray);
F = FOREACH E GENERATE REPLACE(line,'_',' ') AS (line:chararray); //remove the delimiter created by BagToString
//create separate columns
G = FOREACH F GENERATE FLATTEN(TOKENIZE(line,',')) AS (line:chararray);
//we need to rank each chapter so that would be easy when you are doing the count of each word.
H = RANK G;
J = FOREACH H GENERATE rank_G,FLATTEN(TOKENIZE(line)) as (line:chararray);
J1 = GROUP J BY (rank_G, line);
J2 = FOREACH J1 GENERATE COUNT(J) AS (cnt:long),FLATTEN(group.line) as (word:chararray),FLATTEN(group.rank_G) as (rnk:long);
//So J2 result will not have duplicate word within each chapter now.
//So if we group it by word and then filter teh count of that by 2 we are sure that the word is present in all chapters.
J3 = GROUP J2 BY word;
J4 = FOREACH J3 GENERATE SUM(J2.cnt) AS (sumval:long),COUNT(J2) as (cnt:long),FLATTEN(group) as (word:chararray);
J5 = FILTER J4 BY cnt > 2;
J6 = FOREACH J5 GENERATE word,sumval;
dump J6;
//result in order word,count across chapters
Output:
(a,8)
(In,5)
(as,6)
(the,9)
(values,4)

Convert date with milliseconds using PIG

Really stuck on this! Assume I have a following data set:
A | B
------------------
1/2/12 | 13:3.8
04:4.1 | 12:1.4
15:4.3 | 1/3/13
Observations A and B are in general in the format minutes:seconds.milliseconds like A is a click and B is a response. Sometimes time format has a form of month/day/year if any of the events happens to be in the beginning of the new day.
What I want? Is to calculate average difference between B and A. I can easily handle m:s.ms as splitting them into two parts for each A and B and then cast as DOUBLE and perform all needed operations but it all fails when m/d/yy are introduced. The easiest way to omit them but it is not a really good practice. Is there is a clear way to handle such exceptions using PIG?
A thought worth contemplating ....
Ref : http://pig.apache.org/docs/r0.12.0/func.html for String and Date functions used.
Input :
1/2/12|13:3.8
04:4.1|12:1.4
15:4.3|1/3/13
Pig Script :
A = LOAD 'input.csv' USING PigStorage('|') AS (start_time:chararray,end_time:chararray);
B = FOREACH A GENERATE (INDEXOF(end_time,'/',0) > 0 AND LAST_INDEX_OF(end_time,'/') > 0 AND (INDEXOF(end_time,'/',0) != LAST_INDEX_OF(end_time,'/'))
? (ToUnixTime(ToDate(end_time,'MM/dd/yy'))) : (ToUnixTime(ToDate(end_time,'mm:ss.S')))) -
(INDEXOF(start_time,'/',0) >0 AND LAST_INDEX_OF(start_time,'/') > 0 AND (INDEXOF(start_time,'/',0) != LAST_INDEX_OF(start_time,'/'))
? (ToUnixTime(ToDate(start_time,'MM/dd/yy'))) : (ToUnixTime(ToDate(start_time,'mm:ss.S')))) AS diff_time;
C = FOREACH (GROUP B ALL) GENERATE AVG(B.diff_time);
DUMP C;
N.B. In place of ToUnixTime we can use ToMilliSeconds() method.
Output :
(1.0569718666666666E7)

Reverse the group data as a different record using Pig

Split the group record in to different records :
for eg :
Input : (A,(3,2,3))
Output in to 3 new lines:
A,3
A,2
A,3
Can any one let me know the option to do this please?
The problem is when you convert the output of Arraylist to tuple then it will be difficult to achieve what you want, so I recommend this approach, so it will be easy to get the output .
In your UDF code, instead of creating Arraylist, append the output into string seperated by comma and return back to pig script.
You final output should be like this from UDF as a string ie "3,2,3"
Then use the below code to get the result
C = FOREACH B GENERATE $0,NewRollingCount(BagToString($1)) AS rollingCnt
D = FOREACH C GENERATE $0,FLATTEN(TOKENIZE(rollingcnt));
DUMP D;

Apache Pig: Convert bag of tupple to single tupple

I'm trying to convert a bag of tupples into a single tupple:
grunt> describe B;
B: {Comment: {tuple_of_tokens: (token: chararray)}}
grunt> dump B;
({(10),(123),(1234)})
I would like to get (10,123,1234) from B. I've tried using FLATTEN but this gives a new line for each tupple in the bag and that is not what I want.
Is there any way to do this conversion without going to UDF ?
Thanks in advance !
BagToTuple() function is already available in piggybank, you just download the pig-0.11.0.jar and set it in your classpath. For this you no need to write any UDF code.
Download jar from this link:
http://www.java2s.com/Code/Jar/p/Downloadpig0110jar.htm
Reference:
https://pig.apache.org/docs/r0.12.0/api/org/apache/pig/builtin/BagToTuple.html
Example:
input.txt
{(10),(123),(1234)}
{(4),(5)}
Pigscript:
A= LOAD 'input.txt' USING PigStorage() AS (b:{t:(f1)});
B = FOREACH A GENERATE FLATTEN(BagToTuple(b));
DUMP B;
Output:
(10,123,1234)
(4,5)

Resources