how to find similar tuples after grouping using pig latin? - hadoop

I have a data which has the following structure -
1 John US
2 Mary CN
3 Smith US
4 John US
5 Mary CN
I need to find duplicate names within each country. Result should be something like this
{US : (1, John, US),(4,John, US)}
{CN : (2, Mary, CN),(5, Mary, CN)}. Could someone help me with a Pig script for my problem?
I'm able to load the data and group it by Country Name.

I assume you have the input in the following format:
1 John US
2 Mary CN
3 Smith US
4 John US
5 Mary CN
In that case you can come up with the followings:
A = load 'data.txt' using PigStorage(' ')
as (id:int, name:chararray, country:chararray);
B = foreach (group A by (country, name)) generate group.country, A,
COUNT(A) as count;
C = foreach (FILTER B by count > (long)1) generate country, A;
dump C;
(CN,{(2,Mary,CN),(5,Mary,CN)})
(US,{(1,John,US),(4,John,US)})

Related

data.table create table from rows

I would like to analyze a table that reports job codes used by people over the course of several pay periods: I want to know how many times each person has used each job code.
The table lists people in the first column, and pay periods in subsequent columns -- I cannot transpose without creating new problems with names.
The table looks like this:
people
pp1
pp2
pp3
pp4
Bob
A
A
A
C
Ted
B
B
B
B
Alice
B
A
C
C
My desired output looks like this:
people
A
B
C
Bob
3
0
1
Ted
0
4
0
Alice
1
1
2
My code is as follows:
myDT <- data.table(
people = c('Bob','Ted','Alice'),
pp1 = c('A','B','B'),
pp2 = c('A','B','A'),
pp3 = c('A','B','C'),
pp4 = c('C','B','C')
)
id.col=paste('pp',1:3)
myDT[ , table(as.matrix(.SD)), .SDcols = id.col, by = 1:nrow(myDT)]
but it's nowhere close to working
melt(myDT, "people") |>
dcast(people ~ value, fun.aggregate = length)
# people A B C
# <char> <int> <int> <int>
# 1: Alice 1 1 2
# 2: Bob 3 0 1
# 3: Ted 0 4 0

A formula to get the first or last instance of a specific criteria against a list of names

I have a list of multiple instances of the same worker with multiple instances of work and rest.
I need the name of the worker and time of the first instance of Work and the time of the last instance of Rest.
Per person I can get the first work time if I type the actual range by this formula
=index(Filter(C2:D9,C2:C9="work"),1,2) This is Col G
I can determine how many entries of Rest there are for the same person to then work out the time of the last rest
with =index(Filter(C2:D9,C2:C9="rest"),E2,2). This is Col H
And I can get a unique list of names with this formula
=sort(unique(A2:A)) this is Col F
But I can't work out the formula to look at Col F against Cols A to D then return First Work Time in Col G
And a formula to look at Col F (against Cols A to D) and return the Last Rest Time in Col H
My data is in columns A to D and the results I would like are shown in Cols F to H
A B c d e f g h
1 Name Date Action Time Name First Work Last Rest
2 Smith, Fred 14-03-2022 rest 00:00 4 Smith, Fred 06:03 07:08
3 Smith, Fred 14-03-2022 work 06:03 2 Jones, Harry 07:48 08:08
4 Smith, Fred 14-03-2022 rest 06:05
5 Smith, Fred 14-03-2022 work 06:06
6 Smith, Fred 14-03-2022 drive 06:15
7 Smith, Fred 14-03-2022 rest 06:59
8 Smith, Fred 14-03-2022 drive 07:02
9 Smith, Fred 14-03-2022 rest 07:08
10 Jones, Harry 14-03-2022 rest 00:00
11 Jones, Harry 14-03-2022 work 07:48
12 Jones, Harry 14-03-2022 drive 08:01
13 Jones, Harry 14-03-2022 work 08:03
14 Jones, Harry 14-03-2022 drive 08:04
15 Jones, Harry 14-03-2022 work 08:07
16 Jones, Harry 14-03-2022 rest 08:08
I understand you would want the First work and Last rest to be aggregated by name (as you have indicated) and also by date (so that for each name you would get one row with First work and Last rest for each date, if there are multiple dates associated with a given name).
With a data structure as in your example (A: Name, B: Date, C: Action, D: Time), you can use the following formula (you do not need a list of names):
={QUERY(A:D,"select A, B, min(D) where C='work' group by A, B label min(D) 'First work'",1),
QUERY(A:D,"select max(D) where C='rest' group by A, B label max(D) 'Last rest'",1)}
This assumes that there is at least one work and one rest for each name for each date, otherwise it will get misaligned or show error (if the amount of 'work' and 'rest' results are not equal).
If this is possible with your actual data, then it is safer to display the data as two separate tables for First work:
=QUERY(A:D,"select A, B, min(D) where C='work' group by A, B label min(D) 'First work'",1)
and for Last rest:
=QUERY(A:D,"select A,B, max(D) where C='rest' group by A, B label max(D) 'Last rest'",1)
I hope it is easy to understand what the queries are doing, if not then consult the documentation of the QUERY formula and Query Language or ask if something is still not clear.

Power Pivot - Aggregate within groups to determine max value

I'm looking for a DAX formula (for Power Pivot) that aggregates within certain groups and across other groups to determine the maximum.
Here's my data table:
State
Customer
Fruit
Qty
NY
A
Apple
5
NY
A
Orange
1
NY
A
Pear
5
NY
B
Apple
1
NY
B
Orange
6
NY
C
Apple
2
NY
C
Orange
2
NY
C
Pear
5
CA
D
Orange
4
CA
D
Pear
2
I want to determine the most popular fruit by State (ignoring Customer). In NY, there are a total of 8 apples, 9 oranges, and 10 pears. So the formula should return Pear.
Resulting in a table like this:
State
Dominant Fruit
NY
Pear
CA
Orange
What is the Power Pivot formula I need for that Dominant Fruit column on the resulting table? Thanks
You can create a measure to rank the amount of fruits per state like so:
Ranking = RANKX( ALLEXCEPT( 'Table','Table'[Customer],'Table'[State] ) , CALCULATE( SUM( 'Table'[Qty] ) ) )
This measure will rank "Dominant Fruit" (based on the quantity) with 1.
You can than add filter on visual to show only values where rank is 1:

How to find average of tuples in relational algebra calculator

Problem is to use the group by function to find only the average of books checked out by students of a specific department. However, it keeps outputting the average of all checked out books from all students.
What I have so far:
γ avg(Books_Quantity) -> y (Student) ⨝ (σ Department = 'Computer_Science' (Student))
The output should be 1.75, but is instead outputting the average for all the departments.
y Student.Student_ID Student.Student_Name Student.Department Student.Books_Quantity
1.5 1 John Computer_Science 2
1.5 2 Lisa Computer_Science 1
1.5 5 Xina Computer_Science 3
1.5 7 Chang Computer_Science 1
I found the answer. You have to put the Select option inside the table selection operation. Like so:
γ avg(Books_Quantity) -> y (σ Department = 'Computer_Science' (Student))

Reporting Services: how to apply interactive sort for matrix columns?

I need to apply interactive soring for matrix columns, which contain aggregated data.
The report is counting products sold in different places:
Product A Product B Product C
---------------------------------------------------------------
Country 1 5 10 4
City A 3 0 3
City B 2 10 1
---------------------------------------------------------------
Country 2 10 5 5
City C 2 4 2
City D 8 1 3
After descending sorting on "Product A" the table rows should be sorted by "Product A" sales in country, and also by sales in city:
Product A Product B Product C
---------------------------------------------------------------
Country 2 10 5 5
City D 8 1 3
City C 2 4 2
---------------------------------------------------------------
Country 1 5 10 4
City A 3 0 3
City B 2 10 1
The matrix scheme looks like this:
| [Product]
[Country] | [City] | [Count(Product)]
Interactive sort is not supported in matrix.
A workaround could be the one below:
Create a sort by parameter with the values:
Label Value
Country ASC, City ASC 1
Country DESC, City ASC 2
Country ASC, City DESC 3
Country DESC, City DESC 4
Then in the country create two sorting expressions:
=Iif(Parameters!SortBy.Value = 1 OR Parameters!SortBy.Value = 3,Fields!country.Value,"")
ASCENDING sort
=Iif(Parameters!SortBy.Value = 2 OR Parameters!SortBy.Value = 4,Fields!country.Value,"")
DESCENDING sort
Do the same for city:
=Iif(Parameters!SortBy.Value = 1 OR Parameters!SortBy.Value = 2,Fields!city.Value,"")
ASCENDING sort
=Iif(Parameters!SortBy.Value = 3 OR Parameters!SortBy.Value = 4,Fields!city.Value,"")
DESCENDING sort

Resources