Pig how to assign name to columns? - hadoop

I have a csv file which have hundreds of columns, when I load the file into Pig, I dont want to assign each column like
A = load 'path/to/file' as (a,b,c,d,e......)
Since I'll filter a lot of them at the second step:
B = foreach A generate $0,$2,....;
But here, can I assign a name and type to each column of B? something like
B = foreach A generate $0,$2,... AS (a:int,b:int,c:float)
I tried the above code but it doesn't work.
Thanks.

You have to specify them between each comma.
B = foreach A generate $0 as a, $2 as b,...
Note that it just assumes the type that it is already.

Related

Iterate on 2 Data Sources in PIG

I have 2 data sources
1) Params.txt which has the following content
item1
item2
item2
.
.
.
itemN
2) Data.txt which which has following content
he names (aliases) of relations A, B, and C are case sensitive.
The names (aliases) of fields f1, f2, and f3 are case sensitive.
Function names PigStorage and COUNT are case sensitive.
Keywords LOAD, USING, AS, GROUP, BY, FOREACH, GENERAT
and DUMP are case insensitive. They can also be written
The task is to see if each of N items of param file exist in each line of data file.
this is the pseudocode for the same
FOREACH d IN data:
FOREACH PARAM IN PARAMS:
IF PARAM IN d:
GENERATE PARAM,1
Is something of this sort possible in PIG scripting, if yes could you please point me in that direction.
Thanks
This is possible in Pig, but Pig is perhaps an unusual language to solve the problem!
I would approach the problem like this:
Load in Params.txt
Load in Data.txt and tokenise each line (assuming you're happy to split the text on spaces - you might need to think about what to do with punctuation)
Flatten the bag from tokenise to get one "word" per record in the relation.
Join the Params and Data relations. An inner join would give you words that are only in both.
Group the data and then count the occurrence of each word.
params = LOAD 'Params.txt' USING PigStorage() AS (param_word:chararray);
data = LOAD 'Data.txt' USING PigStorage() AS (line:chararray);
token_data = FOREACH data GENERATE TOKENIZE(line) AS words:{(word:chrarray)};
token_flat = FOREACH token_data GENERATE FLATTEN(words) AS (word);
joined = JOIN params BY param_word, token_flat BY word;
word_count = FOREACH (GROUP joined BY params::param_word) GENERATE
group AS param_word,
COUNT(joined) AS param_word_count;

Pig cross join and replace

I have two files. One file having the data as below
Ram,C,Bnglr
Shyam,A,Kolkata
The another file is having a reference
C,Calicut
A,Ahmedabad
Now using pig, I want to search and replace the data in the original file to create a new file ,so that I can create a new file using these two files.
Ram,Class,Bnglr
Shyam,Ahmedabad,Kolkata
Is it possible in pig. I know how to do that in MR but want to try out in pig.
Yes.Join the files and select the required columns and write to the new file
A = LOAD 'file1.txt' AS (a1:chararray,a2:chararray,a3:chararray);
B = LOAD 'file2.txt' AS (b1:chararray,b2:chararray);
C = JOIN A BY a2, B BY b1;
D = FOREACH C GENERATE A::a1,B::b2,A::a3;
STORE D INTO 'file3.txt'
Above logic will work, but if you don't have matching records in second file in that case you will miss record from file1

Pig latin join by field

I have a Pig latin related problem:
I have this data below (in one row):
A = LOAD 'records' AS (f1:chararray, f2:chararray,f3:chararray, f4:chararray,f5:chararray, f6:chararray);
DUMP A;
(FITKA,FINVA,FINVU,FEEVA,FETKA,FINVA)
Now I have another dataset:
B = LOAD 'values' AS (f1:chararray, f2:chararray);
Dump B;
(FINVA,0.454535)
(FITKA,0.124411)
(FEEVA,0.123133)
And I would like to get those two dataset joined. I would get corresponding value from dataset B and place that value beside the value from dataset A. So expected output is below:
FITKA 0.123133, FINVA 0.454535 and so on ..
(They can also be like: FITKA, 0.123133, FINVA, 0.454535 and so on .. )
And then I would be able to multiply values (0.123133 x 0.454535 .. and so on) because they are on the same row now and this is what I want.
Of course I can join column by column but then values appear "end of row" and then I can clean it by using another foreach generate. But, I want some simpler solution without too many joins which may cause performance issues.
Dataset A is text (Sentence in one way..).
So what are my options to achieve this?
Any help would be nice.
A sentence can be represented as a tuple and contains a bag of tuples (word, count).
Therefore, I suggest you change the way you store your data to the following format:
sentence:tuple(words:bag{wordcount:tuple(word, count)})

Hadoop, how to normalize multiple columns data?

I have a file .txt like this
1036177 19459.7356 17380.3761 18084.1440
1045709 19674.2457 17694.8674 18700.0120
1140443 19772.0645 17760.0904 19456.7521
where the first column represent the Key and the others are the values.
I would like to normalize (min-max) each column and after that sum up the columns.
Someone can give me some advice on how do that in MapReduce?
From an algorithmic perspective you'll need to:
Mapper
Parse / tokenize each input line by it's delimiter (space?)
Use a Text object to encapsulate the key field
Either create a custom value class to encapsulate the other fields or use an ArrayWritable wrapper
Output this Key / Value from your Mapper
Reducer
All values will be grouped by the same key, so here you'll just need to process each input value and calculate the min, max and sum for each column
Finally output your result
You might want to look at using Apache Pig which should make this task much easier (untested):
grunt> A = LOAD '/path/to/data.txt' USING PigStorage(' ')
AS (key, fld1:float, fld2:float, fld3:float);
grunt> GRP = GROUP A BY key;
grunt> B = FOREACH GRP GENERATE $0, MIN(fld1), MAX(fld1), SUM(fld1),
MIN(fld2), MAX(fld2), SUM(fld2),
MIN(fld3), MAX(fld3), SUM(fld3);
grunt> STORE B INTO '/path/to/output' USING PigStorage('\t', '-schema');

PIG how to count a number of rows in alias

I did something like this to count the number of rows in an alias in PIG:
logs = LOAD 'log'
logs_w_one = foreach logs generate 1 as one;
logs_group = group logs_w_one all;
logs_count = foreach logs_group generate SUM(logs_w_one.one);
dump logs_count;
This seems to be too inefficient. Please enlighten me if there is a better way!
COUNT is part of pig see the manual
LOGS= LOAD 'log';
LOGS_GROUP= GROUP LOGS ALL;
LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);
Arnon Rotem-Gal-Oz already answered this question a while ago, but I thought some may like this slightly more concise version.
LOGS = LOAD 'log';
LOG_COUNT = FOREACH (GROUP LOGS ALL) GENERATE COUNT(LOGS);
Be careful, with COUNT your first item in the bag must not be null. Else you can use the function COUNT_STAR to count all rows.
Basic counting is done as was stated in other answers, and in the pig documentation:
logs = LOAD 'log';
all_logs_in_a_bag = GROUP logs ALL;
log_count = FOREACH all_logs_in_a_bag GENERATE COUNT(logs);
dump log_count
You are right that counting is inefficient, even when using pig's builtin COUNT because this will use one reducer. However, I had a revelation today that one of the ways to speed it up would be to reduce the RAM utilization of the relation we're counting.
In other words, when counting a relation, we don't actually care about the data itself so let's use as little RAM as possible. You were on the right track with your first iteration of the count script.
logs = LOAD 'log'
ones = FOREACH logs GENERATE 1 AS one:int;
counter_group = GROUP ones ALL;
log_count = FOREACH counter_group GENERATE COUNT(ones);
dump log_count
This will work on much larger relations than the previous script and should be much faster. The main difference between this and your original script is that we don't need to sum anything.
This also doesn't have the same problem as other solutions where null values would impact the count. This will count all the rows, regardless of if the first column is null or not.
USE COUNT_STAR
LOGS= LOAD 'log';
LOGS_GROUP= GROUP LOGS ALL;
LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT_STAR(LOGS);
Here is a version with optimization.
All the solutions above would require pig to read and write full tuple when counting, this script below just write '1'-s
DEFINE row_count(inBag, name) RETURNS result {
X = FOREACH $inBag generate 1;
$result = FOREACH (GROUP X ALL PARALLEL 1) GENERATE '$name', COUNT(X);
};
The use it like
xxx = row_count(rows, 'rows_count');
What you want is to count all the lines in a relation (dataset in Pig Latin)
This is very easy following the next steps:
logs = LOAD 'log'; --relation called logs, using PigStorage with tab as field delimiter
logs_grouped = GROUP logs ALL;--gives a relation with one row with logs as a bag
number = FOREACH LOGS_GROUP GENERATE COUNT_STAR(logs);--show me the number
I have to say it is important Kevin's point as using COUNT instead of COUNT_STAR we would have only the number of lines which first field is not null.
Also I like Jerome's one line syntax it is more concise but in order to be didactic I prefer to divide it in two and add some comment.
In general I prefer:
numerito = FOREACH (GROUP CARGADOS3 ALL) GENERATE COUNT_STAR(CARGADOS3);
over
name = GROUP CARGADOS3 ALL
number = FOREACH name GENERATE COUNT_STAR(CARGADOS3);

Resources