Pig flatten error - hadoop

I tried this script for my nested data :
`books = load 'data/book-seded-workings-reduced.json'
using JsonLoader('user_id:chararray,type:chararray,title:chararray,year:chararray,publisher:chararray,authors:{(name:chararray)},source:chararray');`
group_auth = group books by title;
maped = foreach group_auth generate group, books.authors;
fil = foreach maped generate flatten(books);
DUMP fil;
but I got this error : A column needs to be projected from a relation for it to be used as a scalar
Any idea?

books = load 'input.data'
using JsonLoader('user_id:chararray,
type:chararray,
title:chararray,
year:chararray,
publisher:chararray,
authors:{(name:chararray)},source:chararray');
flatten_authors = foreach books generate title, FLATTEN(authors.name);
dump flatten_authors;
Output : (Input referred from Loading JSON file with serde in Cloudera)
(Modern Database Systems: The Object Model, Interoperability, and Beyond.,null)
(Inequalities: Theory of Majorization and Its Application.,Albert W. Marshall)
(Inequalities: Theory of Majorization and Its Application.,Ingram Olkin)

Related

Pig - Store a complex relation schema in a hive table

here is my deal today. Well, I have created a relation as result of a couple of transformations after have read the relation from hive. the thing is that I want to store the final relation after a couple of analysis back in Hive but I can't. Let see that in my code much clear.
The first String is when I LOAD from Hive and transform my result:
july = LOAD 'POC.july' USING org.apache.hive.hcatalog.pig.HCatLoader ;
july_cl = FOREACH july GENERATE GetDay(ToDate(start_date)) as day:int,start_station,duration; jul_cl_fl = FILTER july_cl BY day==31;
july_gr = GROUP jul_cl_fl BY (day,start_station);
july_result = FOREACH july_gr {
total_dura = SUM(jul_cl_fl.duration);
avg_dura = AVG(jul_cl_fl.duration);
qty_trips = COUNT(jul_cl_fl);
GENERATE FLATTEN(group),total_dura,avg_dura,qty_trips;
};
So, now when I try to store the relation july_result I can't because the schema has changed and I suppose that it's not compatible with Hive:
STORE july_result INTO 'poc.july_analysis' USING org.apache.hive.hcatalog.pig.HCatStorer ();
Even if I have tried to set a special scheme for the final relation I haven't figured it out.
july_result = FOREACH july_gr {
total_dura = SUM(jul_cl_fl.duration);
avg_dura = AVG(jul_cl_fl.duration);
qty_trips = COUNT(jul_cl_fl);
GENERATE FLATTEN(group) as (day:int),total_dura as (total_dura:int),avg_dura as (avg_dura:int),qty_trips as (qty_trips:int);
};
After a research in hortonworks community, I got the solution about how to define an output format for a group relation in pig. My new code looks like:
july_result = FOREACH july_gr {
total_dura = SUM(jul_cl_fl.duration);
avg_dura = AVG(jul_cl_fl.duration);
qty_trips = COUNT(jul_cl_fl);
GENERATE FLATTEN( group) AS (day, code_station),(int)total_dura as (total_dura:int),(float)avg_dura as (avg_dura:float),(int)qty_trips as (qty_trips:int);
};
Thanks guys.

Unable to dump a relation in PIG

Been stuck at a problem since very long. Any help would be appreciable.
So I have a dataset file in /home/hadoop/pig directory. I can view that file, thus no permissions issue.
The dataset has 4 columns separate by "::" as delimiter.
I'm running pig in local mode from inside /home/hadoop/pig directory.
ratingsData = LOAD 'ratings.dat' AS (line:chararray);
ratings = FOREACH ratingsData GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*?)::(.*?)::(.*?)::(.*?)')) AS (uid:int, mid:int, rating:int, timestamp:long);
grouped_mid = GROUP ratings BY mid;
dump grouped_mid;
The above script fails. I can successfully dump 'ratingsData' and 'ratings' relations but not the grouped_mid. But here's the bizarre part. The below script runs successfully.
ratingsData = LOAD 'ratings.dat' AS (line:chararray);
ratings = FOREACH ratingsData GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*?)::(.*?)::(.*?)::(.*?)')) AS (uid:int, mid:int, rating:int, timestamp:long);
STORE ratings INTO 'ratingInfo.txt';
X = LOAD 'ratingInfo.txt' AS (uid:int, mid:int, rating:int, timestamp:long);
grouped_mid = GROUP X BY mid;
dump grouped_mid;
Obviously, the second script has a redundant step. I'm simply storing a relation and reloading it again. I want to avoid this.
Any clarification/explanation would be highly appreciable.
Thanks much.
Just reference to this: pig join with java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
You can modify your scripts to:
ratingsData = LOAD 'ratings.dat' AS (line:chararray);
ratings = FOREACH ratingsData GENERATE FLATTEN((tuple(int, int, int, long))REGEX_EXTRACT_ALL(line,'(.*?)::(.*?)::(.*?)::(.*?)')) AS (uid:int, mid:int, rating:int, timestamp:long);
grouped_mid = GROUP ratings BY mid;
dump grouped_mid;
Tested.

How to pass the value from one load statement into another load statement in pig script

Hi i have two load statements A and B.I want to pass the particular column values from A to B .I tried the following code.
A = LOAD '/user/bangalore/part-m-00000-bangalore' using PigStorage ('\t') as (generatekey:chararray,PropertyID:chararray,ssk:chararray,ptsk:chararray,ptid:chararray,BuiltUpArea:chararray,Price:chararray,pn:chararray,NoOfBedRooms:chararray,NoOfBathRooms:chararray,balconies:chararray,Furnished:chararray,TowerNo:chararray,NoOfTowers:chararray,UnitsOntheFloor:chararray,FloorNoOfProperty:chararray,TotalFloors:chararray,NumberOfLifts:chararray,Facing:chararray,Description:chararray,NewResale:chararray,Possession:chararray,Age:chararray,Ownership:chararray,Type:chararray,PropertyAddress:chararray,Property_Address2:chararray,city:chararray,state:chararray,Property_PinCode:chararray,Locality:chararray,Landmark:chararray,PropertyFeatures:chararray,NearByFacilities:chararray,ReferenceURL:chararray,Flooring:chararray,OverLooking:chararray,ListedOn:chararray,Sellerinfo:chararray,CompanyAddress:chararray,Agency_Address2:chararray,city2:chararray,state2:chararray,Agency_Pincode:chararray,Agency_Phone1:chararray,Agency_Phone2:chararray,ContactName:chararray,Agency_Email:chararray,Agency_WebSite:chararray);
B = foreach A generate Locality;
C = LOAD '/user/april_data/bangalore' using PigStorage ('\t') as (SourceWebSite:chararray,PropertyID:chararray,ListedOn:chararray,ContactName:chararray,TotalViews:int,Price:chararray,PriceperArea:chararray,NoOfBedRooms:int,NoOfBathRooms:int,FloorNoOfProperty:chararray,TotalFloors:int,Possession:chararray,BuiltUpArea:chararray,Furnished:chararray,Ownership:chararray,NewResale:chararray,Facing:chararray,title:chararray,PropertyAddress:chararray,NearByFacilities:chararray,PropertyFeatures:chararray,Sellerinfo:chararray,Description:chararray,emp:chararray);
D = FORACH C generate title
E = join B by Locality,D by title;
the locality column is empty.I want to pass the values from the title column to locality column.the above code prints null only.any help will be appreciated.

How do I get the matching values inside a for loop using FILTER in PIG?

Consider this as my input,
Input (File1):
12345;11
34567;12
.
.
Input (File2):
11;(1,2,3,4,5,6,7,8,9)
12;(9,8,7,6,5,4,3,2,1)
.
.
I would like to get the output as follows:
Output:
(1,2,3,4,5,6,7,8,9)
(9,8,7,6,5,4,3,2,1)
Here's the sample code which I have tried using FILTER and I face some errors with this. Please suggest me some other options.
data1 = load '/File1' using PigStorage(';') as (id,number);
data2 = load '/File2' using PigStorage(';') as (numberInfo, collection);
out = foreach data1{
Data_filter = FILTER data2 by (numberInfo matches CONCAT(number,''));
generate Data_filter;
}
Is it possible do this inside a for loop ? Please let me know. Thanks in advance !
There are no for loops in Apache Pig, if you need to iterate through each row of the data for some specific purpose you need to implement your own UDF. The foreach keyword is not used to create a loop, it is used to transform your data based on your columns, applying UDFs to it. You can also use a nested foreach, where you perform operations over each group in your relation.
However, your syntax is wrong. You are trying to use a nested foreach without grouping your data first. What a nested foreach does, is perform the operations you define in the block of code over a grouped relation. Therefore, the only way your code could work is by grouping the data first:
data1 = load '/File1' using PigStorage(';') as (id,number);
data2 = load '/File2' using PigStorage(';') as (numberInfo, collection);
data1 = group data1 by id;
out = foreach data1{
Data_filter = FILTER data2 by (numberInfo matches CONCAT(number,''));
generate Data_filter;
}
However, this won't work because inside a nested foreach you cannot refer to a different relation like data2.
What you really want, is a JOIN operation over both relations using number for data1 and numberInfo for data2. This will give you this:
joined_data = join data1 by number, data2 by numberInfo;
dump joined_data;
(12345,11,11,(1,2,3,4,5,6,7,8,9))
(34567,12,12,(9,8,7,6,5,4,3,2,1))
In your question you said you only wanted as output the last column, so now you can use a foreach to generate the column you want:
final_data = foreach joined_data generate data2::collection;
dump final_data;
((1,2,3,4,5,6,7,8,9))
((9,8,7,6,5,4,3,2,1))

Error when using MAX in Apache Pig (Hadoop)

I am trying to calculate maximum values for different groups in a relation in Pig. The relation has three columns patientid, featureid and featurevalue (all int).
I group the relation based on featureid and want to calculate the max feature value of each group, heres the code:
grpd = GROUP features BY featureid;
DUMP grpd;
temp = FOREACH grpd GENERATE $0 as featureid, MAX($1.featurevalue) as val;
Its giving me Invalid scalar projection: grpd Exception. I read on different forums that MAX takes in a "bag" format for such functions, but when I take the dump of grpd, it shows me a bag format. Here's a small part of the output from the dump:
(5662,{(22579,5662,1)})
(5663,{(28331,5663,1),(2624,5663,1)})
(5664,{(27591,5664,1)})
(5665,{(30217,5665,1),(31526,5665,1)})
(5666,{(27783,5666,1),(30983,5666,1),(32424,5666,1),(28064,5666,1),(28932,5666,1)})
(5667,{(31257,5667,1),(27281,5667,1)})
(5669,{(31041,5669,1)})
Whats the issue ?
The issue was with column addressing, heres the correct working code:
grpd = GROUP features BY featureid;
temp = FOREACH grpd GENERATE group as featureid, MAX(features.featurevalue) as val;

Resources