Apache PIG - GROUP BY - hadoop

I am looking to achieve the below functionality in Pig. I have a set of sample records like this.
Note that the EffectiveDate column is sometimes blank and also different for the same CustomerID.
Now, as output, I want one record per CustomerID where the EffectiveDate is the MAX. So, for the above example, i want the records highlighted as shown below.
The way I am doing it currently using PIG is this:
customerdata = LOAD 'customerdata' AS (CustomerID:chararray, CustomerName:chararray, Age:int, Gender:chararray, EffectiveDate:chararray);
--Group customer data by CustomerID
customerdata_grpd = GROUP customerdata BY CustomerID;
--From the grouped data, generate one record per CustomerID that has the maximum EffectiveDate.
customerdata_maxdate = FOREACH customerdata_grpd GENERATE group as CustID, MAX(customerdata.EffectiveDate) as MaxDate;
--Join the above with the original data so that we get the other details like CustomerName, Age etc.
joinwithoriginal = JOIN customerdata by (CustomerID, EffectiveDate), customerdata_maxdate by (CustID, MaxDate);
finaloutput = FOREACH joinwithoriginal GENERATE customerdata::CustomerID as CustomerID, CustomerName as CustomerName, Age as Age, Gender as gender, EffectiveDate as EffectiveDate;
I am basically grouping the original data to find the record with the maximum EffectiveDate. Then I join these 'grouped' records with the Original dataset again to get that same record with Max Effective date, but this time I will also get additional data like CustomerName, Age and Gender. This dataset is huge, so this approach is taking a lot of time. Is there a better approach?

Input :
1,John,28,M,1-Jan-15
1,John,28,M,1-Feb-15
1,John,28,M,
1,John,28,M,1-Mar-14
2,Jane,25,F,5-Mar-14
2,Jane,25,F,5-Jun-15
2,Jane,25,F,3-Feb-14
Pig Script :
customer_data = LOAD 'customer_data.csv' USING PigStorage(',') AS (id:int,name:chararray,age:int,gender:chararray,effective_date:chararray);
customer_data_fmt = FOREACH customer_data GENERATE id..gender,ToDate(effective_date,'dd-MMM-yy') AS date, effective_date;
customer_data_grp_id = GROUP customer_data_fmt BY id;
req_data = FOREACH customer_data_grp_id {
customer_data_ordered = ORDER customer_data_fmt BY date DESC;
req_customer_data = LIMIT customer_data_ordered 1;
GENERATE FLATTEN(req_customer_data.id) AS id,
FLATTEN(req_customer_data.name) AS name,
FLATTEN(req_customer_data.gender) AS gender,
FLATTEN(req_customer_data.effective_date) AS effective_date;
};
Output :
(1,John,M,1-Feb-15)
(2,Jane,F,5-Jun-15)

Related

LINQ Left Outer Join only the first record

I'm working on a LINQ query that joins three tables. For the Orders and OrderInfo table I expect a single record in each table for a given order id. However for the ShipRate table, there could be 0, 1 or more records for a given order id. So for this table I am using a left outer join. The query shown below is working if 0 or 1 records exist in the ShipRate table, but for instances where the number of records is > 1, I need to select only the most recent ShipRate record. I tried to do this by replacing the line:
from shipRate in sr.DefaultIfEmpty()
with this:
from shipRate in sr.OrderByDescending(r => r.CreateDate).Take(1).DefaultIfEmpty()
but the query takes forever, as if it is loading the entire ShipRate table. Where have I gone wrong?
var query = (from order in db.Orders
join info in db.OrderInfo
on order.OrderId equals info.OrderId
join shipRate in db.ShipRate
on info.OrderId equals shipRate.OrderId
into sr
from shipRate in sr.DefaultIfEmpty()
where order.OrderId == orderId
select new
{
OrderId = order.OrderId,
OrderDetail = info.OrderDetail,
Carrier = shipRate.Carrier
}).SingleOrDefault();
With a proper model definition your query would be like:
var query = (from order in db.Orders
where order.OrderId == orderId
select new
{
OrderId = order.OrderId,
OrderDetail = order.OrderInfo.OrderDetail,
Carrier = order.OrderInfo.ShipRates.OrderBy(sr =>sr.CreateDate).FirstOrDefault()
}).SingleOrDefault();
I can't be sure though, because you didn't supply sample data and model.
Cetin Basoz's answer is a good one: ideally you'd set up your model in a way that allows you to use navigation properties. If you're using a model generated from your database schema, that typically means setting up foreign and primary keys properly.
If you can't do that, you should still be able to get a similar effect by writing SQL like this:
var query = (from order in db.Orders
where order.OrderId == orderId
let orderInfo = db.OrderInfo.FirstOrDefault(info => order.OrderId == info.OrderId)
let currentShipRate = db.ShipRate
.Where(shipRate => info.OrderId == shipRate.OrderId)
.OrderByDescending(shipRate => shipRate.CreateDate)
.FirstOrDefault()
select new
{
OrderId = order.OrderId,
OrderDetail = orderInfo.OrderDetail,
Carrier = currentShipRate.Carrier
}).SingleOrDefault();
However, LINQ to SQL isn't nearly as good at building advanced queries as Entity Framework, and the symptoms you're describing might be an indication that it's actually doing multiple database round-trips instead of a join. I'd recommend logging the query that you're producing (prior to the .SingleOrDefault()) either by calling .ToString() on the query or by executing your query in LINQPad and clicking on the SQL tab. That might give you a clue as to why the query is misbehaving.
There seems to be a one-to-one relation between Orders and OrderInfos: every Order has exactly one OrderInfo, and every OrderInfo is the info of exactly one Order, namely the Order that the foreign key OrderId refers to.
On the other hand, there seems to be a one-to-many relation between Orders and ShipRates. Every Order has zero or more ShipRates, every ShipRate is a ShipRate of exactly one Order, namely the Order that the foreign key OrderId refers to.
You want several properties of "Orders, each Order with its one and only OrderInfo and its zero or more ShipRates"
Whenever you have a one-to-many relation, and you want "items with their zero or more sub-items", like Schools with their Students, Customers with their Orders, or in your case: Orders with their ShipRates, consider to use one of the overloads of Queryable.GroupJoin
In the other direction: if you want an item with its one and only other item that the foreign key refers to, like Student with the School he attends, Order with the Customer who created the Order, or Order with its one and only OrderInfo, use Queryable.Join.
I mostly use the overload of GroupJoin that has a parameter resultSelector, so I can select exactly what properties I want.
int orderId = ...
var ordersWithShipRates = dbContext.Orders.GroupJoin(dbContext.ShipRates,
order => order.Id, // from every Order take the primary key
shipRate => shipRate.OrderId, // from every ShipRate take the foreign key to Order
// parameter resultSelector: from every Order, with its zero or more ShipRates
// make one new
(order, shipRatesOfThisOrder) => new
{
// Select the Order properties that you plan to use:
Id = order.Id,
Date = order.Date,
...
ShipRates = shipRatesOfThisOrder.Select(shipRate => new
{
// Select the ShipRate properties that you plan to use:
Id = shipRate.Id,
Value = shipRate.Value,
...
})
.ToList(),
// A simple join to get the one and only OrderInfo
OrderInfo = dbContext.OrderInfos.Where(orderInfo => orderInfo.Id == order.Id)
.Select(orderInfo => new
{
// Select the orderInfo properties that you plan to use
Name = orderInfo.Name,
...
})
.FirstOrDefault(),
});

Need help on oracle query

I have two oracle tables, table 1 contains students info and the second table contains student transaction details. Now I want an sql query to bring out the report of the transaction details for each student. eg student ID, name, amount, transaction date etc.
Note, a student can have many transactions, so I want a situation where by if student with ID 1 bought 3 items, in the result of the query I want to see student ID 1 and the sum of 3 items bought.
I don't want the student ID to repeat 3 times and the number of items bought.
Thanks
EDIT:
Here's the query I have so far:
select
distinct(s.spriden_id),
s.spriden_last_name,
s.spriden_first_name,
t.tbraccd_detail_code,
t.sum(tbraccd_amount),
t.tbraccd_term_code,
t.tbraccd_user,
t.TBRACCD_DATE
from SPRIDEN s, TBRACCD t
where s.spriden_pidm = t.tbraccd_pidm
and t.tbraccd_term_code = 201320
and t.tbraccd_desc = 'Misc Book Store Charges';
(The first table is SPRIDEN while the second table is TBRACCD)
You can use GROUP BY to group students, as below:
select
s.spriden_id,
sum(t.tbraccd_amount),
from SPRIDEN s, TBRACCD t
where s.spriden_pidm = t.tbraccd_pidm
and t.tbraccd_term_code = 201320
and t.tbraccd_desc = 'Misc Book Store Charges'
GROUP BY s.spriden_id;
MODIFIED VERSION to select all columns:
select
s.spriden_id,
t.tbraccd_entry_date,
t.tbraccd_term_code,
t.tbraccd_user,
sum(t.tbraccd_amount)
from SPRIDEN s, TBRACCD t
where s.spriden_pidm = t.tbraccd_pidm
and t.tbraccd_term_code = 201320
and t.tbraccd_desc = 'Misc Book Store Charges'
GROUP BY
s.spriden_id,
t.tbraccd_entry_date,
t.tbraccd_term_code,
t.tbraccd_user;

Using group by subquery to deduplicate record set

I would like to know how I can return the activity Id to a parent query using a query like the following to de-duplicate the parent record set. The issue here is that using this group by or distinct I cannot find a way.
I will use all of the group by fields to determine a unique record. But, I need to use only the record with the select min(status.effective_date)
The query returns the correct date values, but I cannot link it back to the parents activity records with just that date value.
select min(status.effective_date)
from accounts
, address
, activity
, status
where accounts.par_row_id = activity.account_id
and address.row_id = activity.address_id
and status.par_row_id = activity.status_id
and account.name = 'xyz'
group by account.name, address.addr, address.ADDR_LINE_2, address.ADDR_LINE_3
, address.ADDR_LINE_4, address.CITY, address.COUNTRY, address.X_STATE
, address.ZIPCODE
You should try this query:
select min(activity.row_id) keep(dense_rank first order by status.effective_date) as activity_id
from accounts
, address
, activity
, status
where accounts.par_row_id = activity.account_id
and address.row_id = activity.address_id
and status.par_row_id = activity.status_id
and accounts.name = 'xyz'
group by accounts.name, address.addr, address.ADDR_LINE_2, address.ADDR_LINE_3
, address.addr_line_4, address.city, address.country, address.x_state
, address.ZIPCODE;

Aggregated information and projection in Pig Latin

I'm trying to apply a maximum aggregate function to a table by grouping on some fields and projecting. Can I refer to other non-grouping fields in the original table in the aggregating projection?
As as example I have a table blah with schema (user_id: long, order_id: long, product_id: long, gender: chararray, size: int), where user_id, order_id and product_id create a composite key but there can be multiple user ids and order ids. To get the maximum size for each order I use
result_table = foreach (group blah by (user_id, order_id)) generate
FLATTEN(group) as (user_id, order_id),
MAX(blah.size) as max_size;
Is there some way I can also add product_id to the creation of result_table so I have a table containing the user_id, order_id, product_id and max_size (max_size would be duplicated over differing product_ids) ?
If I could refer to the product_id specific to each grouped user_id and order_id I can save myself a mapreduce job by not joining back with the original table to access this field. Thanks guys.
Pig is well suited for such things, it has bags and that enables it to do things which in SQL require extra joins.
If you do the following:
grp = group blah by (user_id, order_id);
describe grp;
you will see that there is a bag with the schema identical to the schema of the "blah" (something like group:(user_id:long, order_id: long), blah: {(user_id: long, order_id: long, product_id: long, gender: chararray, size: int)}). That is a very powerful thing as it will allow us to create an output with all of the original rows with group summaries in each row without using inner joins:
grp = group blah by (user_id, order_id);
result_table = foreach grp generate
FLATTEN(blah.(user_id, order_id, product_id)), -- flatten the bag created by original group
MAX(blah.size) as max_size;
if the same product_id appears multiple times within group of user_id, order_id than there will be duplicates, to avoid it we could use a DISTINCT nested into FOREACH:
grp = group blah by (user_id, order_id);
result_table = foreach grp {
dist = distinct blah.(user_id, order_id, product_id); -- remove duplicates
generate flatten(dist), MAX(blah.size) as max_size;
}
It will be done in a single MapReduce job.

how to filter a timestamp in Pig

I have table with this schema:
(id: chararray, ts: long, data: chararray)
which ts stand for timestamp and store with UNIX time;
Because the data will update and the ts will be modified if update happen, id will not change. But all of this old record and new record will store in hdfs.
I just want to look at the latest data, so I write the pig code like this:
grp = GROUP table BY id;
rst = FOREACH grp {
latest = FILTER table BY ts == MAX(table.ts);
GENERATE latest.id AS id,
latest.data AS data;
}
But seems that Pig code did not work, so do any one can give me a suggestion to make this code work?
Have you tried order by ts in descending order?
LATEST = LIMIT (ORDER table BY ts desc) 1;
dump LATEST;
I'm not sure why that's not working, but it also wouldn't be too hard to write a UDF to do this. Simply input the bag of tuples, loop over them, and return the tuple with biggest timestamp. Then you could just do:
grp = GROUP table BY id;
latest = FOREACH grp GENERATE my.udfs.LatestInBag(table);

Resources