aggregate ordered rows in hive table - hadoop

i have a table in hive with 4 columns like this:
row_id| user_id|product_id| duration
1 1 product1 3
2 1 product1 1
3 1 product2 6
4 1 product3 2
5 1 product1 4
6 1 product4 3
7 1 product4 5
8 1 product4 7
9 2 product4 3
10 2 product4 6
i want to aggregate rows of the same product for each user, sum the duration and count the clicks only if they are consequent in order
row_id| user_id|product_id |duration_per_product |clicks_per_product
1 1 product1 4 2
2 1 product2 6 1
3 1 product3 2 1
4 1 product1 4 1
5 1 product4 15 3
6 2 product4 9 2
any ideas how to do that in hive 1.1.0?
group by obviously doesn't work because i don't want to group products if they are consequent , i have tried case,lag and lead but didn't work!
thanks!

First off, this is something you would want to do in a loop, hive is not very suitable for these kind of problems.
That being said, here is an approach that should work:
Suppose this is our dataset
1 1 product1 3
2 1 product1 1
3 1 product2 6
4 1 product1 4
Identify starter rows: 1,3,4
This can be done by doing a left join on id=id+1 and seeing whether user and product match.
Join everything onto these starters by user and product:
1 1
1 2
1 4
3 3
4 1
4 2
4 4
Filter out things that are in the wrong order (starter after row), remaining are:
1 1
1 2
1 4
3 3
4 4
Group to find the maximum valid starter for each row, remaining will be:
1 1
1 2
3 3
4 4
Now join to reattach the relevant dimensions
1 1 3
1 2 1
3 3 6
4 4 4
Now you can get the results by grouping on the starter id.
1 4
3 6
4 4
Of course you can now choose to use another join to attach the name of the product.
1 product1 4
3 product2 6
4 product1 4
And that is all!

Related

Loop for sum of column-values based on two conditions with R

I have a data frame looking like this, with many more Persons, timepoints and values.
Person Timepoint Value
1 P1 1 2
2 P1 1 3
3 P1 2 1
4 P1 2 4
5 P1 2 2
6 P2 1 3
7 P2 1 5
8 P2 2 2
9 P2 3 1
10 P2 3 2
11 P2 3 3
12 … … …
I now would like to
create a loop for calculating the mean of the values for each person at each timepoint
and write the results directly in a new column e.g. df$Mean for the different timepoints (MeanT1, MeanT2...)
or create a new data frame with the values so I can merge them with the original data frame.
Example:
Person Timepoint Value Mean_T1 Mean_T2 X.
1 P1 1 2 2.5 2.3 …
2 P1 1 3 2.5 2.3 …
3 P1 2 1 2.5 2.3 …
4 P1 2 4 2.5 2.3 …
5 P1 2 2 2.5 2.3 …
6 P2 1 3 4 2 …
7 P2 1 5 4 2 …
8 P2 2 2 4 2 …
9 P2 3 1 4 2 …
10 P2 3 2 4 2 …
11 P2 3 3 4 2 …
12 … … … … … …
I tried several options but no one works.
Does anybody has an idea how to proceed? Every advance is welcome!
Thank you very much in avance!

How to get the several data from each group on query builder laravel

a16s table
id p_id u_id time
1 1 2 0
2 1 1 1
3 1 5 2
4 1 6 3
5 1 7 4
6 2 2 2
7 2 3 1
8 2 1 0
9 3 2 11
10 3 4 8
11 3 8 15
I want to get
the first two data orderby time from each group
p_id u_id time
1 2 0
1 1 1
2 1 0
2 3 1
3 4 8
3 2 11
I try the query
$result = DB::table('a16s')
->select ('p_id','u_id','time'))
->orderBy('time', 'desc')
->groupBy('p_id')
->get();
echo '<pre>' ;
print_r($result);
I got the error
SQLSTATE[42000]: Syntax error or access violation: 1055 Expression #2 of SELECT list is not in GROUP BY clause and contains nonaggregated column...
Can I use groupby twice? I Want to get this result to use on the jquery datatable.
from the database
id p_id u_id approve time
1 1 1 1 1
2 1 2 1 2
3 1 3 1 3
4 1 4 0 4
5 1 5 0 5
6 2 1 0 1
7 2 2 1 2
8 2 5 0 3
9 2 6 0 4
10 3 2 1 1
11 3 5 1 2
12 3 8 1 3
to get the table
try this
$result = DB::table('a16s')
->select('p_id', 'u_id', 'time')
->orderBy('time', 'desc')
->get()
->groupBy('p_id')
->map(function ($deal) {
return $deal->take(2);
});
With your SQL version, u_id will either need to be left out of the select or added to the GROUP BY clause.
See this MySql doc for more info.
Using this trick:
$join = DB::table('a16s')->select('p_id')
->selectRaw('GROUP_CONCAT(time ORDER BY time ASC) times')->groupBy('p_id');
$sql = '(' . $join->toSql() . ') latest';
$result = DB::table('a16s')
->select('a16s.*')
->join(DB::raw($sql), function($join) {
$join->on('a16s.p_id','=','latest.p_id');
$join->whereBetween(DB::raw('FIND_IN_SET(`a16s`.`time`, `latest`.`times`)'), [1, 2]);
})
->get();

ORACLE calculate Sales، returns and the rest for a customer in the same table for the sam product

ORACLE select
calculate Sales، returns and the rest for a customer in the same table for the same product according to trans type
i need to calculate total sales and total returns and the rest for the customer and items.
and group by customer
Trans_Type:
1= Sales
2= Return
ID Trans_Type DATE Items_ID Quantity Clint_ID
--- ---------- -------- ---------- ---------- ----------
1 1 16-OCT-09 701555 3 1
2 2 12-DEC-09 701555 1 1
3 1 30-JUL-10 701511 63 2
4 2 30-JUL-10 701555 1 1
5 1 30-JUL-10 701234 2 3
6 1 30-JUL-10 701234 5 3
7 2 30-JUL-10 701511 1 2
8 1 30-JUL-10 701522 3 2
9 1 30-JUL-10 701555 2 3
10 1 30-JUL-10 701555 4 2
11 2 30-JUL-10 701555 2 2
If I understood everything correct you need to use case when ... and group by ... clauses, like here:
select clint_id, items_id, qty, ret, nvl(qty,0) - nvl(ret,0) rest
from (
select clint_id, items_id,
sum(case when trans_type = 1 then quantity end) qty,
sum(case when trans_type = 2 then quantity end) ret
from data group by clint_id, items_id )
order by clint_id, items_id
SQLFiddle demo

Sort on specific columns, output only one of those identical but having the highest number in another column

I have records like these:
1 4 6 4 2 4 8
2 3 5 4 6 7 1
5 4 6 4 3 8 4
1 4 6 4 5 7 1
5 7 3 3 3 6 3
6 7 3 3 4 8 4
I want to sort them on columns 2,3,4, and 6 and keep just one of those identical in column 2,3,4 and having the biggest number in column 6 such as:
1 4 6 4 5 7 1
2 3 5 4 6 7 1
5 4 6 4 3 8 4
5 7 3 3 3 6 3
6 7 3 3 4 8 4
I have tried all kinds of combinations between sort and uniq but everything fails because uniq cannot be applied onto a specific column. The only thing I came up with is to change the order of the columns as to first sort as above then move records 2,3,and 4 to the end and then run uniq with -w as to focus only on the last 3 records. This seems quite inefficient to me.
Thanks for help!
You can achieve this with two passes of sort(assuming in the first place I understand your requirement correctly, seeing that the desired data snippet posted above does not match your description of it) . The first pass sorts by field 2 through 4 ascending and field 6 descending, the second pass sorts on fields 2 through 4 only but passing in the "stable sort" and unique flags in addition to pick out those rows for each combination of fields 2-4 that have the highest value from field 6
sort -k2,4n -k6,6nr file.txt | sort -k2,4n -s -u
2 3 5 4 6 7 1
5 4 6 4 3 8 4
6 7 3 3 4 8 4

How to Use Where with Many to Many Comparison

I have problem with LINQ Query in following scenario:
I have Activity and ActivityTeacher Two Table and List of Some Teachers.
Activity Table
ActivityID Date Class
1 4/4/2012 1
2 4/5/2013 2
3 4/6/2013 5
4 5/6/2013 2
5 5/16/2013 1
6 5/20/2013 8
7 5/21/2013 7
8 6/22/2013 6
9 8/10/2013 5
10 8/12/2013 4
ActivityTeacher Table
ActivityID TeacherID
1 2
1 3
1 4
2 6
3 6
3 6
3 4
2 5
4 2
4 3
4 6
5 8
5 7
5 6
6 6
6 7
6 9
6 10
6 1
6 2
7 2
7 8
7 9
7 10
8 3
8 4
8 6
8 7
9 10
9 3
9 2
10 1
10 2
List of Teachers={2,3,4}
Now I want to select records from Activity based on List of Teachers={2,3,4}
without using foreach loop.
The Activity entity should have a Teachers navigation property you can utilize:
context.Activities
.Where(x => listOfTeachers.Contains(x.Teachers.Select(t => t.TeacherId)));
If listOfTeachers contains the three IDs 2, 3, 4, this query should translate to SQL that is similar to the following:
select a.*
from Activity a
inner join ActivityTeacher at
on a.activityid = at.activityid
where at.teacherid in (2, 3, 4);

Resources