Outputting a tuple with space between two values in pig - hadoop

I have been using pig to filter a large file which contains data in tab separated form. The data inside that file is in the following form - fname lname age
Bill Gates 50
Warren Buffet 100
Elon Musk 80
Jack Dorsey 10
I want to filter this filter out where age > 50 and store the resulting data in (fname lname) form in a file using Pig.
Here is the code which I'm using -
data = LOAD 'persons.txt' AS (fname:chararray, lname:chararray, age:int);
data1 = FILTER data BY age > 50;
data2 = FOREACH data1 GENERATE (fname, lname);
STORE data2 INTO 'result.txt';
By using this code, I ma getting following output -
(Warren,Buffet)
(Elon,Musk)
This is not the output which I want instead I want to get following output -
(Warren Buffet)
(Elon Musk)
In order to get this kind of output I have tried using FOREACH data1 GENERATE (fname lname) without a comma between fname and lname. But it shows error Synatx error, unexpected symbol at or near fname.
Can anybody help me how can I get correct ouput?
Note -> I am running Pig on Hadoop Cluster not locally.

Use CONCAT with a space in between fname and lname
data2 = FOREACH data1 GENERATE CONCAT(fname,' ',lname);

Related

Iterate on 2 Data Sources in PIG

I have 2 data sources
1) Params.txt which has the following content
item1
item2
item2
.
.
.
itemN
2) Data.txt which which has following content
he names (aliases) of relations A, B, and C are case sensitive.
The names (aliases) of fields f1, f2, and f3 are case sensitive.
Function names PigStorage and COUNT are case sensitive.
Keywords LOAD, USING, AS, GROUP, BY, FOREACH, GENERAT
and DUMP are case insensitive. They can also be written
The task is to see if each of N items of param file exist in each line of data file.
this is the pseudocode for the same
FOREACH d IN data:
FOREACH PARAM IN PARAMS:
IF PARAM IN d:
GENERATE PARAM,1
Is something of this sort possible in PIG scripting, if yes could you please point me in that direction.
Thanks
This is possible in Pig, but Pig is perhaps an unusual language to solve the problem!
I would approach the problem like this:
Load in Params.txt
Load in Data.txt and tokenise each line (assuming you're happy to split the text on spaces - you might need to think about what to do with punctuation)
Flatten the bag from tokenise to get one "word" per record in the relation.
Join the Params and Data relations. An inner join would give you words that are only in both.
Group the data and then count the occurrence of each word.
params = LOAD 'Params.txt' USING PigStorage() AS (param_word:chararray);
data = LOAD 'Data.txt' USING PigStorage() AS (line:chararray);
token_data = FOREACH data GENERATE TOKENIZE(line) AS words:{(word:chrarray)};
token_flat = FOREACH token_data GENERATE FLATTEN(words) AS (word);
joined = JOIN params BY param_word, token_flat BY word;
word_count = FOREACH (GROUP joined BY params::param_word) GENERATE
group AS param_word,
COUNT(joined) AS param_word_count;

how can I merge sparse tables in hadoop?

I have a number of csv files containing a single column of values:
File1:
ID|V1
1111|101
4444|101
File2:
ID|V2
2222|102
4444|102
File3:
ID|V3
3333|103
4444|103
I want to combine these to get:
ID|V1|V2|V3
1111|101||
2222||102|
3333|||103
4444|101|102|103
There are many (100 million) rows, and about 100 columns/tables.
I've been trying to use Pig, but I'm a beginner, and am struggling.
For two files, I can do:
s1 = load 'file1.psv' using PigStorage('|') as (ID,V1);
s2 = load 'file2.psv' using PigStorage('|') as (ID,V2);
cg = cogroup s1 by ID, s2 by ID
merged = foreach cg generate group, flatten((IsEmpty(s1) ? null : s1.V1)), flatten((IsEmpty(s2) ? null : s2.V2));
But I would like to do this with whatever files are present, up to 100 or so, and I don't think I can cogroup that many big files without running out of memory. So I'd rather get the column name from the header than just hard-coding it. In other words, this 2-file toy example doesn't scale.

Use column values to fetch data from other dataset (sort of data transpose ) using Apache pig

I've two data-sets , one is source data and another is Metadata .
source data
============
name city state country
Ram Agra UP India
John Aligarh UP India
Shyam Merrut UP India
Isha Kanpur UP India
Metadata
=========
column_input flag
name Y
city Y
state N
country N
FINAL OUTPUT
============
name city
Ram Agra
John Aligarh
Shyam Merrut
Isha Kanpur
We required few columns from source based on meta information,we need to refer/read metadata data-set first, logic- flag should 'Y' ,here for 'city' and 'state' so we need to pull only these two columns from source data.
I'm able to get the column name from metadata data-set , now how i can pass this column name to to source to fetch corresponding columns data.
current code
meta_data_read = LOAD '/user/aidb' USING PigStorage(',') AS (column_input,flag);
filter_flag = FILTER meta_data_read by LOWER(TRIM(Flag)) == 'y' ;
gen_required_col = FOREACH filter_flag GENERATE column_input;
dump gen_required_col ;
(city)
(state)
If all data rows has to processed against the same meta I would create a small (shell) script what process the meta file and respond the filed names comma separated. Than store it in a pig variable and use that variable to project the required fields.
Here's an example (NOTE: I did not created the shell script just declared the PROJECT variable, but the script will be easy)
set pig.pretty.print.schema true;
%default PROJECT 'a,c'
data = LOAD 'SO/simple.txt' USING PigStorage(',') AS (a:chararray, b:chararray, c:chararray, d:chararray);
DESCRIBE data;
dump data;
data_p = FOREACH data GENERATE
$PROJECT;
DESCRIBE data_p;
DUMP data_p;
So the PROJECT variable contains the fields needs to be projected, and just use that in a FOREACH statement.
The describe results of this:
data: {
a: chararray,
b: chararray,
c: chararray,
d: chararray
}
data_p: {
a: chararray,
c: chararray
}
I hope this solves your problem.

sampling of records inside group by throwing error

sample data : (tsv file: sampl)
1 a
2 b
3 c
raw= load 'sampl' using PigStorage() as (f1:chararray,f2:chararray);
grouped = group raw by f1;
describe grouped;
fields = foreach grouped {
x = sample raw 1;
generate x;
}
When I run this I am getting error at the line x = sample raw 1;
ERROR 1200: mismatched input 'raw' expecting LEFT_PAREN
Is sampling not allowed for a grouped record?
You can't use 'sample' command inside nested block.This is not supported in pig.
Only few operations operations like (CROSS, DISTINCT, FILTER, FOREACH, LIMIT, and ORDER BY) are allowed in nested block. You have to use the sample command outside of the nested block.
The other problem is, you are loading your input data using default delimiter ie tab. But your input data is delimited with space, so you need to change your script like this
raw= load 'sampl' using PigStorage(' ') as (f1:chararray,f2:chararray);

Pig how to assign name to columns?

I have a csv file which have hundreds of columns, when I load the file into Pig, I dont want to assign each column like
A = load 'path/to/file' as (a,b,c,d,e......)
Since I'll filter a lot of them at the second step:
B = foreach A generate $0,$2,....;
But here, can I assign a name and type to each column of B? something like
B = foreach A generate $0,$2,... AS (a:int,b:int,c:float)
I tried the above code but it doesn't work.
Thanks.
You have to specify them between each comma.
B = foreach A generate $0 as a, $2 as b,...
Note that it just assumes the type that it is already.

Resources