How to validate a list using PIG script - hadoop

I have List 1 with following schema
{customerId: int,storeId: int,products: {(prodId: int,name: chararray)}}
Customer List with following schema
{uniqueId: int,customerId: int,name: chararray}
Store List with following schema
{uniqueId: int,storeNum: int,name: chararray}
and Product List with schema
{uniqueId: int,sku: int,productName: chararray}
Now I want to search customerId , storeId and prodId of each item in List 1 with other lists to check the ids are valid or not. The valid items has to be stored in on file and invalid items in another.
As PIG is very new for me, I feel this as very complex to do. Please give me a good logic to do this job using Apache PIG.

First of all load all your data think these as tables
cust_data = LOAD '\your\path\to\customer\data' USING PigStorage() as (uniqueId: int, customerId: int, name: chararray);
store_data = LOAD '\your\path\to\store\data' USING PigStorage() as (uniqueId: int, storeNum: int, name: chararray);
product_data = LOAD '\your\path\to\product\data' USING PigStorage() as (uniqueId: int, sku: int, productName: chararray);
You can check your loaded data schema by
DESCRIBE cust_data;
DESCRIBE store_data;
DESCRIBE product_data;
JOIN Customer and Store data first using uniqueId (we are doing a equijoin)
cust_store_join = JOIN cust_data BY uniqueId, store_data BY uniqueId;
then Generate your columns
cust_store = FOREACH cust_store_join GENERATE cust_data::uniqueId as uniqueId, cust_data::customerId as customerId, cust_data::name as cust_name, store_data::storeNum as storeNum, store_data::name as store_name;
Now JOIN customer store and product using uniqueId (we are doing a equijoin)
cust_store_product_join = JOIN cust_store BY uniqueId, product_data BY uniqueId;
finally generate all your desired columns
customer_store_product = FOREACH cust_store_product_join GENERATE cust_store::uniqueId as uniqueId, cust_store::customerId as customerId, cust_store::cust_name as cust_name, cust_store::storeNum as storeNum, product_data::sku as sku, product_data::productName as productName;
now store your desired columns in your local /hdfs directory
below store command will store all the matching uniqueId from all three tables i.e., customer, store, product
STORE customer_store_product INTO '\your\output\path' USING PigStorage(',');
Similarly you can join your list1 schema and generate columns and store data using same logic.
Hope this helps

Related

PIG: How to remove '::' in the column name

I have a pig relation like below:
FINAL= {input_md5::type: chararray,input_md5::name: chararray,input_md5::id: long,input_md5::age: chararray,test_1:: type: chararray,test_2::name:chararray}
I am trying to store all columns for input_md5 relation to a hive table.
like all input_md5::type: chararray,input_md5::name: chararray,input_md5::id: long,input_md5::age: chararray not taking test_1:: type: chararray,test_2::name:chararray
is there any command in pig which filters only columns of input_md5.Something like below:
STORE= FOREACH FINAL GENERATE all input_md5::type .
I know that pig have :
FOREACH FINAL GENERATE all input_md5::type as type syntax, but i have many columns so I cannot use as in my code.
Because when i try:
STORE= FOREACH FINAL GENERATE input_md5::type .. bus_input_md5::name;
Pig throws an error:
org.apache.hive.hcatalog.common.HCatException : 2007 : Invalid column position in partition schema : Expected column <type> at position 1, found column <input_md5::type>
Thanks in advance,
Resolved this issue , below is the fix:
Create a relation with some filter condition as below:
DUMMY_RELATION= FILTER SOURCE_TABLE BY type== ''; (I took a column named type ,this can be filtered by any column in the table , all that matters is we need its schema)
FINAL_DATASET= UNION DUMMY_RELATION,SCHEMA_1,SCHEMA_2;
(this new DUMMY_RELATIONn should be placed 1st in the union)
Now you no more have :: operator And your column names would match hive table's column names, provided your source table (to DUMMY_RELATION) and target table have same column order.
Thanks to myself :)
I implemented Neethu's example this way. May have typos, but it shows how to implement this idea.
tableA = LOAD 'default.tableA' USING org.apache.hive.hcatalog.pig.HCatLoader();
tableB = LOAD 'default.tableB' USING org.apache.hive.hcatalog.pig.HCatLoader();
--load empty table
finalTable = LOAD 'default.finalTable' USING org.apache.hive.hcatalog.pig.HCatLoader();
--example operations that end up with '::' in column names
g = group tableB by (id);
j = JOIN tableA by id LEFT, g by group;
result = foreach j generate tableA::id, tableA::col2, g::tableB;
--union empty finalTable and result
result2 = union finalTable, result;
--bob's your uncle
STORE result2 INTO 'finalTable' USING org.apache.hive.hcatalog.pig.HCatStorer();
Thanks to Neethu!

Apache PIG - GROUP BY

I am looking to achieve the below functionality in Pig. I have a set of sample records like this.
Note that the EffectiveDate column is sometimes blank and also different for the same CustomerID.
Now, as output, I want one record per CustomerID where the EffectiveDate is the MAX. So, for the above example, i want the records highlighted as shown below.
The way I am doing it currently using PIG is this:
customerdata = LOAD 'customerdata' AS (CustomerID:chararray, CustomerName:chararray, Age:int, Gender:chararray, EffectiveDate:chararray);
--Group customer data by CustomerID
customerdata_grpd = GROUP customerdata BY CustomerID;
--From the grouped data, generate one record per CustomerID that has the maximum EffectiveDate.
customerdata_maxdate = FOREACH customerdata_grpd GENERATE group as CustID, MAX(customerdata.EffectiveDate) as MaxDate;
--Join the above with the original data so that we get the other details like CustomerName, Age etc.
joinwithoriginal = JOIN customerdata by (CustomerID, EffectiveDate), customerdata_maxdate by (CustID, MaxDate);
finaloutput = FOREACH joinwithoriginal GENERATE customerdata::CustomerID as CustomerID, CustomerName as CustomerName, Age as Age, Gender as gender, EffectiveDate as EffectiveDate;
I am basically grouping the original data to find the record with the maximum EffectiveDate. Then I join these 'grouped' records with the Original dataset again to get that same record with Max Effective date, but this time I will also get additional data like CustomerName, Age and Gender. This dataset is huge, so this approach is taking a lot of time. Is there a better approach?
Input :
1,John,28,M,1-Jan-15
1,John,28,M,1-Feb-15
1,John,28,M,
1,John,28,M,1-Mar-14
2,Jane,25,F,5-Mar-14
2,Jane,25,F,5-Jun-15
2,Jane,25,F,3-Feb-14
Pig Script :
customer_data = LOAD 'customer_data.csv' USING PigStorage(',') AS (id:int,name:chararray,age:int,gender:chararray,effective_date:chararray);
customer_data_fmt = FOREACH customer_data GENERATE id..gender,ToDate(effective_date,'dd-MMM-yy') AS date, effective_date;
customer_data_grp_id = GROUP customer_data_fmt BY id;
req_data = FOREACH customer_data_grp_id {
customer_data_ordered = ORDER customer_data_fmt BY date DESC;
req_customer_data = LIMIT customer_data_ordered 1;
GENERATE FLATTEN(req_customer_data.id) AS id,
FLATTEN(req_customer_data.name) AS name,
FLATTEN(req_customer_data.gender) AS gender,
FLATTEN(req_customer_data.effective_date) AS effective_date;
};
Output :
(1,John,M,1-Feb-15)
(2,Jane,F,5-Jun-15)

Hive Query - Unable to find for movies having more than 30 ratings, what is the average rating

I have created a table in hive using query
CREATE TABLE u_data (
userid INT,
movieid INT,
rating INT,
unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
then loaded some data into that, now I want to retrieve average rating of movies having more than 30 ratings.
I tried creating a view using query:
create view ratingcount as select movieid, count(rating) as num_of_ratings from u_data group by movieid;
and then used join query:
Select movieid, avg(rating) from u_data join ratingcount on u_data.movieid = ratingcount .movieid where num_of_ratings >30;
which is giving exception. please let me know how to retrieve required data?
try this,
Select movieid, avg(rating) from u_data group by movieid having count(rating) > 30;

EF6 How to query where children contains all values of a list

Say I have a document table, with doc_id (PK) and doc_name fields, a category table with cat_id (PK) and cat_name fields, and a document_categories table with doc_id (PK, FK) and cat_id (PK, FK) fields, so I can attribute one or many categories to each document.
I have generated a model with EF6 in "Database first" mode, which gives me two entities: document and category, each containing a field which is a collection of children.
document contains a categories field which lists the categories of the document, and vice-versa in the category entity.
Now, I want to query all documents that contain category 1 AND category 2.
Let's say the database contains the following documents:
Doc A: Categories 1, 3
Doc B: Categories 1, 2
Doc C: Categories 1
Doc D: Categories 1, 2, 3
My query should return docs B and D.
How can I achieve that with EF6 using Linq?
Searched long on this site and in Google but found nothing for this particular request ... Thanks for your help
Use this:
var ids = new int[]{1,2};
var docs = context.Documents
.Where(d=> ids.All(cid=> d.Categories.Any(dc=>dc.cat_id == cid))).ToList();
Or
var ids = new int[]{1,2};
var result = from d in context.Documents
where ids.All(id => d.Categories.Any(dc=> dc.cat_id == id))
select s;

Aggregated information and projection in Pig Latin

I'm trying to apply a maximum aggregate function to a table by grouping on some fields and projecting. Can I refer to other non-grouping fields in the original table in the aggregating projection?
As as example I have a table blah with schema (user_id: long, order_id: long, product_id: long, gender: chararray, size: int), where user_id, order_id and product_id create a composite key but there can be multiple user ids and order ids. To get the maximum size for each order I use
result_table = foreach (group blah by (user_id, order_id)) generate
FLATTEN(group) as (user_id, order_id),
MAX(blah.size) as max_size;
Is there some way I can also add product_id to the creation of result_table so I have a table containing the user_id, order_id, product_id and max_size (max_size would be duplicated over differing product_ids) ?
If I could refer to the product_id specific to each grouped user_id and order_id I can save myself a mapreduce job by not joining back with the original table to access this field. Thanks guys.
Pig is well suited for such things, it has bags and that enables it to do things which in SQL require extra joins.
If you do the following:
grp = group blah by (user_id, order_id);
describe grp;
you will see that there is a bag with the schema identical to the schema of the "blah" (something like group:(user_id:long, order_id: long), blah: {(user_id: long, order_id: long, product_id: long, gender: chararray, size: int)}). That is a very powerful thing as it will allow us to create an output with all of the original rows with group summaries in each row without using inner joins:
grp = group blah by (user_id, order_id);
result_table = foreach grp generate
FLATTEN(blah.(user_id, order_id, product_id)), -- flatten the bag created by original group
MAX(blah.size) as max_size;
if the same product_id appears multiple times within group of user_id, order_id than there will be duplicates, to avoid it we could use a DISTINCT nested into FOREACH:
grp = group blah by (user_id, order_id);
result_table = foreach grp {
dist = distinct blah.(user_id, order_id, product_id); -- remove duplicates
generate flatten(dist), MAX(blah.size) as max_size;
}
It will be done in a single MapReduce job.

Resources