join and group-by in Hadoop Pig - hadoop

Often see people are using group by and join for the same problem, suppose I have a student table and score table, want to find student name with related course score. It seems we can resolve this problem by either using join, or using group by? Wondering pros and cons for the two solutions. Post data structure and code below. Thanks.
table students:
student ID, student name, student email address
score table:
student ID, course ID, score
student_scores = group students by (studentId) inner, scores by (studentId);
student_scores = join students by student Id, scores by studentId;

In the Pig Latin Manuall about Join it says:
Note the following about the GROUP/COGROUP and JOIN operators:
The GROUP and JOIN operators perform similar functions. GROUP creates a nested set of output tuples while JOIN creates a flat set of output tuples.
The GROUP/COGROUP and JOIN operators handle null values differently (see Nulls and JOIN Operator).
Not sure if it pros & cons , but they are diffrent

Related

Display all the fields associated with the record using Impala

Suppose, I have a student table with some fields in impala. Imagine there is a field called total_mark and I should find the student details with maximum mark from each branch.
My table is like this :-
In this table I have to get the details of student with maximum marks from each department.
My query will be like this :-
select id,max(total_marks) from student_details group by department;
But using this query I can get only the id and total_marks. Provided there can be students with same name,age I can't group with fields like age,name .
So how should I query the table to get all the details of top student from each department ??
Thanks in advance.
You can make use of the JOIN concept
select stu.*
from student_details stu
join
( select department,max(total_marks) as max
from student_details
group by department
) rank
on stu.department=rank.department and stu.total_marks=rank.max;

LINQ Join and performing aggregate functions

I am facing issue in writing LINQ query to perform join on three tables and then performing aggregate functions on the rows. Kindly do provide some help.
I have three tables
Table 1: Students (Id, Name)
Table 2: Subject (SubID, Title, Id)
Table 3: Grade (Id, SubID, marks)
I have to write LINQ query to get the results as following
Count of Students table rows
Count of Grade table rows
Sum of
marks of all rows in Grade table
I am writing query as following but it is not up to the mark as i feel it is not correct.
var _Count = from student in _context.Students
join subject in _context.Subject on student.Id equals subject.Id
join grade in _context.Grade on subject.SubID equals grade.SubID
// How to group them?
select new { //How to take and return the counts?};

Comparing values from two different tables in SQL Plus Oracle

I have multiple tables with some foreign keys in some. Here are the tables;
Doctor
Doctor_id, FirstName, SecondName,etc...
Hospital
Hospital_id, Name...
Job
Job_id
fk Doctor_id
fk Hospital_id
I'm trying to show a list of doctors that works in 'X' hospital. How would I run this query?
SELECT FirstName, SecondName
FROM Doctor, Job, Hospital
WHERE Hospital.Name = 'HospitalName' AND Job.hospital_id = Hospital.hospital_id;
I'm not sure if that particular query is right because it shows every single doctor (not the ones that work in 'HospitalName'. If that is correct than I guess the foreign keys ain't right?
Thanks in advance. DG
You should learn to use proper join syntax. Then mistakes like this are much less likely to occur:
SELECT d.FirstName, d.SecondName
FROM Doctor d join
Job j
on d.Doctor_id = j.Doctor_Id join
Hospital h
on j.hospital_id = h.hospital_id
WHERE h.Name = 'HospitalName';
This also adds in table aliases for every column, so someone reading the query knows where they are coming from.
You are missing one join condition.
SELECT FirstName, SecondName
FROM Doctor, Job, Hospital
WHERE Hospital.Name = 'HospitalName'
AND Job.hospital_id = Hospital.hospital_id
AND job.Doctor_id = Doctor.doctor_id;

Linq query for one to many

My question involves MVC + Linq query. I will try to make it simple without going into the details of the Model, View, etc.. Say I have 2 tables T1 & T2. T1 holds restaurants details & T2 holds restaurants image paths. T2 rows contain restaurantID. Now if T2 has more than one rows of image paths for a Restaurant and I only need the first image path from T2 in the linq query how would I form such query? I tried to simplify the question as in fact I have 6 table joins related to the Restaurants in the query. I formed a view model which only contains the fields I want to display. I am trying to populate the view model in the controller & the query is in the controller obviously.
When I join T2 to the query, I get all the Restaurants details together with the images. But the view repeats the same Restaurant as many times as the number of table rows in T2 which is not what I want. This is the problem from the way I set the query. The query uses joins. I only need the first row from T2 while I get all from the Restaurant details. I failed to find an example for such requirement on the web so far. Your directions will be much appreciated.
Serhat Albayoglu
On your join you can use an into and then in the select you can select the FirstOrDefault
var query = from t in context.T1
join t2 in context.T2 on t.Id equals t2.RestaurantID into tgroup
select
{
t2.FirstOrDefault().path
};

Active Record Join with most recent association object attribute

I have a Contact model which has many Notes. On one page of my app, I show several attributes of a table of contacts (name, email, latest note created_at).
For the note column, I'm trying to write a joins statement that grabs all contacts along with just their latest note (or even just the created_at of it
What I've come up with is incorrect as it limits and orders the contacts, not their notes:
current_user.contacts.joins(:notes).limit(1).order('created_at DESC')
If you just want the created_at value for the most recent note for each contact, you can first create a query to find the max value and then join with that query:
max_times = Note.group(:contact_id).select("contact_id, MAX(created_at) AS note_created_at").to_sql
current_user.contacts.select("contacts.*, note_created_at").joins("LEFT JOIN (#{max_times}) max_times ON contacts.id = max_times.contact_id")
If you want to work with the Note object for the most recent notes, one option would be to select the notes and group them by the contact_id. Then you can read them out of the hash as you work with each Contact.
max_times = Note.group(:contact_id).select("contact_id, MAX(created_at) AS note_created_at").to_sql
max_notes = Note.select("DISTINCT ON (notes.contact_id) notes.*").joins("INNER JOIN (#{max_times}) max_times ON notes.contact_id = max_times.contact_id AND notes.created_at = note_created_at").where(contact_id: current_user.contact_ids)
max_notes.group_by(&:contact_id)
This uses DISTINCT ON to drop dups in case two notes have exactly the same contact_id and created_at values. If you aren't using PostgreSQL you'll need another way to deal with dups.

Resources