Movie data set analysis using PIG

Movie data set analysis using PIG - hadoop

I have the following data set for a movie database:
Ratings: UserID, MovieID, Rating :: Movies: MovieID, Title :: Users: UserID, Gender, Age
Now I have to JOIN the above 3 datasets and determine which movie has the highest rating among females and lowest rating among males, and vice versa.
I have done the JOIN:
myusers = LOAD '/user/cloudera/movies/input/users.dat'
USING PigStorage(':')
AS (user:int, n1, gender:chararray, n2, age:int);
ratings = LOAD '/user/cloudera/movies/input/ratings.dat'
USING PigStorage(':')
AS (user:int, n1, movie:int, n2, rating:int);
movies = LOAD '/user/cloudera/movies/input/movies.dat'
USING PigStorage(':')
AS (movie:int,n1,title:chararray);
data = JOIN ratings BY user, myusers BY user;
data2= JOIN data BY ratings::movie, movies BY movie;
But after this I am running into many issues such as "ERROR 0: Scalar has more than one row in the output" when I try to print columns from data2. Any ideas to help me accomplish this task?

After the following step
data = JOIN ratings BY user, myusers BY user;
Create two datasets one for male and another for female by using gender as the filter.Order the dataset and get the max and min for both the datasets.
male = FILTER data by gender == 'M'; -- Use the gender value for male
female = FILTER data by gender == 'F';
m_max = LIMIT (ORDER male by rating DESC) 1;
f_max = LIMIT (ORDER female by rating DESC) 1;
m_min = LIMIT (ORDER male by rating ASC) 1;
f_min = LIMIT (ORDER female by rating ASC) 1;

Related

oracle query using Union All

Need to get room type wise female and male count, below is my query
(SELECT WARD,ROOMTYPE ,FEMALE,MALE
FROM
(SELECT NS.WARD,RC.ROOMTYPE,(DECODE(IP.GENDER,'F',COUNT(IP.HNO)))FEMALE,(DECODE(IP.GENDER,'M',COUNT(IP.HNO)))MALE FROM BEDSHIFT R,BED B,NURSTATION NS,PATIENTS IP,ROOMTYPE RT,ROOMCATEGORY RC
WHERE R.BD_CODE=B.BD_CODE AND B.NS_CODE=NS.NS_CODE AND R.IP_NO=IP.IP_NO AND R.RMC_OCCUPBY='B'
AND B.RT_CODE=RT.RT_CODE AND RT.RC_CODE=RC.RC_CODE
AND IP.IPC_STATUS IS NULL AND R.RMC_RELESETYPE IS NULL GROUP BY RC.ROOMTYPE,NS.WARD,IP.GENDER
UNION ALL
SELECT NS.WARD,RC.ROOMTYPE,(DECODE(IP.GENDER,'F',COUNT(IP.HNO)))FEMALE,(DECODE(IP.GENDER,'M',COUNT(IP.HNO)))MALE FROM PATIENTS IP,BED BD,NURSTATION NS,ROOMTYPE RT,ROOMCATEGORY RC
WHERE IP.BD_CODE=BD.BD_CODE
AND BD.RT_CODE=RT.RT_CODE AND RT.RC_CODE=RC.RC_CODE
AND BD.NS_CODE=NS.NS_CODE AND IP.IPC_STATUS IS NULL GROUP BY RC.ROOMTYPE,NS.WARD,IP.GENDER)
T
GROUP BY FEMALE,MALE,WARD,ROOMTYPE) ORDER BY WARD
returns below
need to get it as

Group only once and sum the males and females:
SELECT WARD, ROOMTYPE, sum(FEMALE), sum(MALE)
FROM (SELECT NS.WARD,
RC.ROOMTYPE,
IP.GENDER, 'F', COUNT(IP.HNO))) FEMALE,
IP.GENDER, 'M', COUNT(IP.HNO))) MALE
FROM BEDSHIFT R,
BED B,
NURSTATION NS,
PATIENTS IP,
ROOMTYPE RT,
ROOMCATEGORY RC
WHERE R.BD_CODE = B.BD_CODE
AND B.NS_CODE = NS.NS_CODE
AND R.IP_NO = IP.IP_NO
AND R.RMC_OCCUPBY = 'B'
AND B.RT_CODE = RT.RT_CODE
AND RT.RC_CODE = RC.RC_CODE
AND IP.IPC_STATUS IS NULL
AND R.RMC_RELESETYPE IS NULL
UNION ALL
SELECT NS.WARD,
RC.ROOMTYPE,
(DECODE(IP.GENDER, 'F', COUNT(IP.HNO))) FEMALE,
(DECODE(IP.GENDER, 'M', COUNT(IP.HNO))) MALE
FROM PATIENTS IP,
BED BD,
NURSTATION NS,
ROOMTYPE RT,
ROOMCATEGORY RC
WHERE IP.BD_CODE = BD.BD_CODE
AND BD.RT_CODE = RT.RT_CODE
AND RT.RC_CODE = RC.RC_CODE
AND BD.NS_CODE = NS.NS_CODE
AND IP.IPC_STATUS IS NULL) T
GROUP BY WARD, ROOMTYPE
ORDER BY WARD
This way the query should be faster, than grouping three times, or even fourtimes. You could also use subqueries for males and females, which could be even faster without grouping at all (without seeing the schema, I wouldn't give you the query).

change first line
(SELECT WARD,ROOMTYPE ,SUM(FEMALE) AS FEMALE, SUM(MALE) AS MALE
and last line
GROUP BY WARD,ROOMTYPE) ORDER BY WARD,ROOMTYPE

I am looking to achieve the below functionality in Pig. I have a set of sample records like this.
Note that the EffectiveDate column is sometimes blank and also different for the same CustomerID.
Now, as output, I want one record per CustomerID where the EffectiveDate is the MAX. So, for the above example, i want the records highlighted as shown below.
The way I am doing it currently using PIG is this:
customerdata = LOAD 'customerdata' AS (CustomerID:chararray, CustomerName:chararray, Age:int, Gender:chararray, EffectiveDate:chararray);
--Group customer data by CustomerID
customerdata_grpd = GROUP customerdata BY CustomerID;
--From the grouped data, generate one record per CustomerID that has the maximum EffectiveDate.
customerdata_maxdate = FOREACH customerdata_grpd GENERATE group as CustID, MAX(customerdata.EffectiveDate) as MaxDate;
--Join the above with the original data so that we get the other details like CustomerName, Age etc.
joinwithoriginal = JOIN customerdata by (CustomerID, EffectiveDate), customerdata_maxdate by (CustID, MaxDate);
finaloutput = FOREACH joinwithoriginal GENERATE customerdata::CustomerID as CustomerID, CustomerName as CustomerName, Age as Age, Gender as gender, EffectiveDate as EffectiveDate;
I am basically grouping the original data to find the record with the maximum EffectiveDate. Then I join these 'grouped' records with the Original dataset again to get that same record with Max Effective date, but this time I will also get additional data like CustomerName, Age and Gender. This dataset is huge, so this approach is taking a lot of time. Is there a better approach?

Input :
1,John,28,M,1-Jan-15
1,John,28,M,1-Feb-15
1,John,28,M,
1,John,28,M,1-Mar-14
2,Jane,25,F,5-Mar-14
2,Jane,25,F,5-Jun-15
2,Jane,25,F,3-Feb-14
Pig Script :
customer_data = LOAD 'customer_data.csv' USING PigStorage(',') AS (id:int,name:chararray,age:int,gender:chararray,effective_date:chararray);
customer_data_fmt = FOREACH customer_data GENERATE id..gender,ToDate(effective_date,'dd-MMM-yy') AS date, effective_date;
customer_data_grp_id = GROUP customer_data_fmt BY id;
req_data = FOREACH customer_data_grp_id {
customer_data_ordered = ORDER customer_data_fmt BY date DESC;
req_customer_data = LIMIT customer_data_ordered 1;
GENERATE FLATTEN(req_customer_data.id) AS id,
FLATTEN(req_customer_data.name) AS name,
FLATTEN(req_customer_data.gender) AS gender,
FLATTEN(req_customer_data.effective_date) AS effective_date;
};
Output :
(1,John,M,1-Feb-15)
(2,Jane,F,5-Jun-15)

Display only the largest count in a group by statment

I'm trying to display only the largest group in this group by statement;
SELECT COUNT(type) AS booking, type FROM booking b, room r WHERE r.rno = b.rno AND r.hno = b.hno GROUP BY type;
I modified it so we get this query response now you can see group double is larger then family.
BOOKING TYPE
5 double
2 family
I know there is a HAVING keyword you can add in order display only a count compared to a number so I could do COUNT(type) HAVING > 2 or similar but that's not very dynamic and that would only work in this instance because I know the two amounts.

ORDER BY COUNT(type) DESC LIMIT 1

There isn't a having statement that does this. But you can use rownum with a subquery:
select t.*
from (SELECT COUNT(type) AS booking, type
FROM booking b join
room r
on r.rno = b.rno AND r.hno = b.hno
GROUP BY type
order by count(type) desc
) t
where rownum = 1;

Just order your query..
order by booking desc
regards

TRY this
SELECT COUNT(type) AS booking, type FROM booking b, room r WHERE r.rno = b.rno AND r.hno = b.hno ORDER BY type DESC LIMIT 1

Linq query help

I'm attempting to write a linq query which uses several tables of related data and have gotten stuck.
The expected result: I need to return the three most populous metropolitan areas per region by population descending.
tables w/sample data:
MetroAreas -- ID, Name
2, Greater New York
Cities -- ID, Name, StateID
1293912, New York City, 10
CityPopulations -- ID, CityID, CensusYear, Population
20, 1293912, 2008, 123456789
21, 1293912, 2007, 123454321
MetroAreaCities -- ID, CityID, MetroAreaID
1, 1293912, 2
States -- ID, Name, RegionID
10, New York, 5
Regions -- ID, Name
5, Northeast
I start with the metro areas. Join the MetroAreaCities to get city IDs. Join Cities to get state IDs. Join States to get the region ID. Join regions so I can filter with a where. I get stuck when I try to include CityPopulations. I only want the three most populous metro areas for a given region. Doing a simple join on the cityPopulations returns a record per year.
(Here's what I have so far, this query was written for SubSonic 3):
return from p in GeoMetroArea.All()
join q in GeoMetroAreaCity.All() on p.ID equals q.MetroAreaID
join r in GeoCity.All() on q.CityID equals r.ID
join s in GeoState.All() on r.StateID equals s.ID
join t in GeoRegion.All() on s.RegionID equals t.ID
where t.ID == regionObjectPassedToMethod.ID
select p;
Can anyone help me with this query or point me in the right direction? Thank you very very much.

I haven't compiled it, but this should get you close:
var regionID = 5;
var year = (from c in GeoCityPopulation.All()
select c.CensusYear
).Max();
var metros =
// States in Region
from s in GeoStateAll()
where s.RegionID == regionID
// Cities in State
join c in GeoCity.All() on s.CityID equals c.ID
// Metro Area for City
join mc in GeoMetroAreaCity.All() on c.ID equals mc.CityID
// Population for City
join cp in GeoCityPopulation.All() on c.ID equals cp.CityID
where cp.CensusYear = year
// Group the population values by Metro Area
group cp.Population by mc.MetroAreaID into g
select new
{
MetroID = g.Key, // Key = mc.MetroAreaID
Population = g.Sum() // g = seq. of Population values
} into mg
// Metro for MetroID
join m in GeoMetroArea.All() on mg.MetroID equals m.ID
select new { m.Name, mg.Population };

how would you query this in linq?

lets say i have 2 tables: products (just product ID and name) and sales (sale ID, product ID, amount, date)
now, given a start date and end date, i want to sum for every product its total sales amount in the given time frame
notice that naturally some products will just have zero sales
how should i write this query?

var products =
from p in mycontext.Products
select new
{
Product = p,
Sales = p.Sales
.Where(s=>s.StartDate > startDate && s.EndDate < endDate)
.Sum(s=>s.amount)
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio