Pig - counting members across a group - hadoop

Say I have a relation Students, with fields grade and teacher. I want to group by both grade and teacher, but retain a count of all the students per grade in each group. Something like:
classes = GROUP Students BY (grade,teacher);
classes = FOREACH classes {
GENERATE
(### COUNT OF ALL STUDENTS IN GRADE ###) as grade_size,
Students as students,
teacher as teacher;
}
But I can't figure out how to do the filter from inside the group statement. Some kind of filter, but I don't know to scope the grade of the students outside vs. inside the group.

There are 2 ways of doing it:
1) Using Group By grade and teacher, than count, than Flatten and Group By grade.
classes = GROUP Students BY (grade,teacher);
teachers = FOREACH classes GENEARATE FLATTEN(group) as (grade,teacher), COUNT(Students) as perTeacehr;
grade = GROUP teachers BY grade;
result = FOREACH grade GENERATE FLATTEN(teachers), SUM(teachers.perTeacher) as perGrade;
describe result;
dump result;
2) Group By grade, than use UDF from BagGroup from DataFu library to do in memory group by, but this is vulnerable to possible heap memory exceptions, but is faster.

Related

How to find total count of distinct items along with item using Room?

Let's say I have a table kids_data
id
kid_name
father_name
1
Alisa
Bosconovitch
2
Ebby
Bosconovitch
3
John
Peter
4
simmy
Alladin
5
sara
Alladin
Now I want to fetch distinct names and their count too.
So if I perform
SELECT DISTINCT father_name from kids_data
What I will get is
father_name
Bosconovitch
Peter
Alladin
I want count of each kid too.
As Bosconovitch & Alladin have Two children then I want that number also.
Using Room Database in Android? How?
My class is
data class FatherData(
val kid_count: Int,
val father_name: String
)
In Dao
// what Query or how can i do this to achieve that, please.
#Query("SELECT DISTINCT father_name from kids_data")
fun getFatherData(): List<FatherData>
This does what I wanted!
This query will be used.
#Query("select father_name, count(father_name) as kid_count from kids_data group by father_name")
fun getFatherData(): List<FatherData>

Which way will get high performance while selecting many data IQueryable Vs For loop (Using Entity Frame Work)

I am trying to get a list from the database containing two or more lists inside that list.(using .net core, entity framework).Assume I have two table call header and details table.
Header Table
Detail Table
And I want the result like this:
{
"data":[
{
"Country":"Singapore",
"Hospital_List":[
{
"Hospital_Name":"SG Host A"
},
{
"Hospital_Name":"SG Host A"
}
]
},
{
}
]
}
I only know two ways to get the result like this,First Way, select Country list data with blank Hospital list as List,then for loop that list to select related Hospital list from db again.
And Second Way,select Country list data with blank Hospital list as IQueryable List,and then select related Hospital list via jointing with Hospital Table.So my question is
Which way should i used to get higher performance? And Is any other way?
Please remember there has a lot of field and data in my real table.
For loop give give you the lowest perfomance, because you will create SQL query for each iteration. Instead of this, try following solution:
from hospital in hospitals
group hospital by hospital.CID into gh
join country in countries
on gh.FirstOrDefault().CID equals country.CID
select new
{
Country = country.Country,
Hospital_List = from h in gh select h
}
EDITED:
And if your model created right you can use this code:
from hospital in hospitals
join country in countries
on hospital.Country equals country
group hospital by hospital.CID into gh
select new
{
Country = from h in gh select h.Country.Country,
Hospital_List = from h in gh select h
}

Relational algebra for one-to-many relations

Suppose I have the following relations:
Academic(academicID (PK), forename, surname, room)
Contact (contactID (PK), forename, surname, phone, academicNO (FK))
I am using Java & I want to understand the use of the notation.
Π( relation, attr1, ... attrn ) means project the n attributes out of the relation.
σ( relation, condition) means select the rows which match the condition.
⊗(relation1,attr1,relation2,attr2) means join the two relations on the named attributes.
relation1 – relation2 is the difference between two relations.
relation1 ÷ relation2 divides one relation by another.
Examples I have seen use three tables. I want to know the logic when only two tables are involved (academic and contact) as opposed to three (academic, contact, owns).
I am using this structure:
LessNumVac = Π( σ( job, vacancies < 2 ), type )
AllTypes = Π( job, type )
AllTypes – LessNumVac
How do I construct the algebra for:
List the names of all contacts owned by academic "John"
List the names of all contacts who is owned by academic "John".
For that, you would join the Academic and Conctact relations, filter for John, and project the name attributes. For efficiency, select John before joining:
πforename, surename (Contact ⋈academicNO = academicID (πacademicID (σforename = "John" Academic))))
You have to extend your operations set with natural join ⋈, Left outer join ⟕ and/or Right outer join ⟖ to show joins.
There is a great Wikipedia article about Relational Algebra. You should definitely read that one!

Efficient algorithm that takes a Twitter user and finds top users by order of how many of his followers they follow

The title is very wordy. So I'll explain with an example.
We have a database of 10,000 twitter users with each following up to 2000 users. The algorithm takes as input one random never before seen user (including the people that follow him), and returns the twitter users from the database by order of how many of his followers they follow.
i.e.
We have:
User A follows 1,2,3,4
User B follows 3,4,5,6
User C follows 4,8,9
We enter user X who has users 3,4,5 following him.
The algorithm should return:
B: 3 matches (3,4,5)
A: 2 matches (3,4)
C: 1 match (4)
Store the data as a sparse integer matrix A of size 10^5x10^5 with ones at the appropriate places. Then, given a user i, compute A[i,] * A (matrix multiplication). Then sort.
Assuming you have a table structure similar to this:
Table Users
Id (PK, uniqueidentifier, not null)
Username (nvarchar(50), not null)
Table UserFollowers
UserId (FK, uniqueidentifier, not null)
FollowerId (uniqueidentifier, not null)
You can use the following query to get the common parents of followers of the followers of the user in query
SELECT Users_Inner.Username, COUNT(Users_Inner.Id) AS [Total Common Parents]
FROM Users INNER JOIN
UserFollowers ON Users.Id = UserFollowers.FollowerId INNER JOIN
UserFollowers AS UserFollowers_Inner ON UserFollowers.FollowerId = UserFollowers_Inner.UserId INNER JOIN
Users AS Users_Inner ON UserFollowers_Inner.FollowerId = Users_Computed.Id
WHERE (UserFollowers.UserId = 'BD34A1FF-FCF5-4D35-B8A3-EFFB1587A874')
GROUP BY Users_Inner.Username
ORDER BY COUNT(Users_Inner.Id) DESC
would something like this work?
for f in followers(x)
for ff in followers(f)
count[ff]++ // assume it is initially 0
sort the ff-s by their counts
Unlike the matrix solution, the complexity of this is proportional to the number of people involved rather than the number of users on twitter.

Pig: Pulling individual fields out after a GROUP

In PigLatin, I want to pull the other fields out of a record I want to select because of an aggregate, such as MAX.
I'm having trouble explaining the problem, so here is an example. Let's say I want to grab the name of the oldest person at a household:
Relation A is four columns, (name, address, zipcode, age)
B = GROUP A BY (address, zipcode); # group by the address
# generate the address, the person's age, but how do I grab that person's name?
C = FOREACH B GENERATE FLATTEN(group), MAX(age), ??? Name ???;
How do I generate the name of the person with the MAX age?
The problem with your logic is there can be more then 1 people with the MAX(age). Then you have to GROUP BY (name, address, age). But to give you a quick answer I will write that gets only one of the max ages. (I am not sure its the optimum way though)
C = FOREACH B {
DA = ORDER A BY age DESC;
DB = LIMIT DA 1;
GENERATE FLATTEN(group), FLATTEN(DB.age), FLATTEN(DB.name);
}
Be careful with frail's answer which is accepted, as it would have undesirable behavior if the number in the LIMIT command is higher than 1. In particular, in that case the output would be a cross-product between all ages and names due to the last two FLATTEN calls. Then, if the value in the LIMIT is N, there would be N^2 output rows instead of intended N.
Much safer is to do the following in the GENERATE line, which would give exactly the same result as the accepted answer when 'LIMIT 1' is used:
GENERATE FLATTEN(group) AS (address, zipcode), FLATTEN(DB.(age, name)) AS (age, name);

Resources