i have a table in hive with 4 columns like this:
row_id| user_id|product_id| duration
1 1 product1 3
2 1 product1 1
3 1 product2 6
4 1 product3 2
5 1 product1 4
6 1 product4 3
7 1 product4 5
8 1 product4 7
9 2 product4 3
10 2 product4 6
i want to aggregate rows of the same product for each user, sum the duration and count the clicks only if they are consequent in order
row_id| user_id|product_id |duration_per_product |clicks_per_product
1 1 product1 4 2
2 1 product2 6 1
3 1 product3 2 1
4 1 product1 4 1
5 1 product4 15 3
6 2 product4 9 2
any ideas how to do that in hive 1.1.0?
group by obviously doesn't work because i don't want to group products if they are consequent , i have tried case,lag and lead but didn't work!
thanks!
First off, this is something you would want to do in a loop, hive is not very suitable for these kind of problems.
That being said, here is an approach that should work:
Suppose this is our dataset
1 1 product1 3
2 1 product1 1
3 1 product2 6
4 1 product1 4
Identify starter rows: 1,3,4
This can be done by doing a left join on id=id+1 and seeing whether user and product match.
Join everything onto these starters by user and product:
1 1
1 2
1 4
3 3
4 1
4 2
4 4
Filter out things that are in the wrong order (starter after row), remaining are:
1 1
1 2
1 4
3 3
4 4
Group to find the maximum valid starter for each row, remaining will be:
1 1
1 2
3 3
4 4
Now join to reattach the relevant dimensions
1 1 3
1 2 1
3 3 6
4 4 4
Now you can get the results by grouping on the starter id.
1 4
3 6
4 4
Of course you can now choose to use another join to attach the name of the product.
1 product1 4
3 product2 6
4 product1 4
And that is all!
Related
I have a data frame looking like this, with many more Persons, timepoints and values.
Person Timepoint Value
1 P1 1 2
2 P1 1 3
3 P1 2 1
4 P1 2 4
5 P1 2 2
6 P2 1 3
7 P2 1 5
8 P2 2 2
9 P2 3 1
10 P2 3 2
11 P2 3 3
12 … … …
I now would like to
create a loop for calculating the mean of the values for each person at each timepoint
and write the results directly in a new column e.g. df$Mean for the different timepoints (MeanT1, MeanT2...)
or create a new data frame with the values so I can merge them with the original data frame.
Example:
Person Timepoint Value Mean_T1 Mean_T2 X.
1 P1 1 2 2.5 2.3 …
2 P1 1 3 2.5 2.3 …
3 P1 2 1 2.5 2.3 …
4 P1 2 4 2.5 2.3 …
5 P1 2 2 2.5 2.3 …
6 P2 1 3 4 2 …
7 P2 1 5 4 2 …
8 P2 2 2 4 2 …
9 P2 3 1 4 2 …
10 P2 3 2 4 2 …
11 P2 3 3 4 2 …
12 … … … … … …
I tried several options but no one works.
Does anybody has an idea how to proceed? Every advance is welcome!
Thank you very much in avance!
a16s table
id p_id u_id time
1 1 2 0
2 1 1 1
3 1 5 2
4 1 6 3
5 1 7 4
6 2 2 2
7 2 3 1
8 2 1 0
9 3 2 11
10 3 4 8
11 3 8 15
I want to get
the first two data orderby time from each group
p_id u_id time
1 2 0
1 1 1
2 1 0
2 3 1
3 4 8
3 2 11
I try the query
$result = DB::table('a16s')
->select ('p_id','u_id','time'))
->orderBy('time', 'desc')
->groupBy('p_id')
->get();
echo '<pre>' ;
print_r($result);
I got the error
SQLSTATE[42000]: Syntax error or access violation: 1055 Expression #2 of SELECT list is not in GROUP BY clause and contains nonaggregated column...
Can I use groupby twice? I Want to get this result to use on the jquery datatable.
from the database
id p_id u_id approve time
1 1 1 1 1
2 1 2 1 2
3 1 3 1 3
4 1 4 0 4
5 1 5 0 5
6 2 1 0 1
7 2 2 1 2
8 2 5 0 3
9 2 6 0 4
10 3 2 1 1
11 3 5 1 2
12 3 8 1 3
to get the table
try this
$result = DB::table('a16s')
->select('p_id', 'u_id', 'time')
->orderBy('time', 'desc')
->get()
->groupBy('p_id')
->map(function ($deal) {
return $deal->take(2);
});
With your SQL version, u_id will either need to be left out of the select or added to the GROUP BY clause.
See this MySql doc for more info.
Using this trick:
$join = DB::table('a16s')->select('p_id')
->selectRaw('GROUP_CONCAT(time ORDER BY time ASC) times')->groupBy('p_id');
$sql = '(' . $join->toSql() . ') latest';
$result = DB::table('a16s')
->select('a16s.*')
->join(DB::raw($sql), function($join) {
$join->on('a16s.p_id','=','latest.p_id');
$join->whereBetween(DB::raw('FIND_IN_SET(`a16s`.`time`, `latest`.`times`)'), [1, 2]);
})
->get();
ORACLE select
calculate Sales، returns and the rest for a customer in the same table for the same product according to trans type
i need to calculate total sales and total returns and the rest for the customer and items.
and group by customer
Trans_Type:
1= Sales
2= Return
ID Trans_Type DATE Items_ID Quantity Clint_ID
--- ---------- -------- ---------- ---------- ----------
1 1 16-OCT-09 701555 3 1
2 2 12-DEC-09 701555 1 1
3 1 30-JUL-10 701511 63 2
4 2 30-JUL-10 701555 1 1
5 1 30-JUL-10 701234 2 3
6 1 30-JUL-10 701234 5 3
7 2 30-JUL-10 701511 1 2
8 1 30-JUL-10 701522 3 2
9 1 30-JUL-10 701555 2 3
10 1 30-JUL-10 701555 4 2
11 2 30-JUL-10 701555 2 2
If I understood everything correct you need to use case when ... and group by ... clauses, like here:
select clint_id, items_id, qty, ret, nvl(qty,0) - nvl(ret,0) rest
from (
select clint_id, items_id,
sum(case when trans_type = 1 then quantity end) qty,
sum(case when trans_type = 2 then quantity end) ret
from data group by clint_id, items_id )
order by clint_id, items_id
SQLFiddle demo
I have records like these:
1 4 6 4 2 4 8
2 3 5 4 6 7 1
5 4 6 4 3 8 4
1 4 6 4 5 7 1
5 7 3 3 3 6 3
6 7 3 3 4 8 4
I want to sort them on columns 2,3,4, and 6 and keep just one of those identical in column 2,3,4 and having the biggest number in column 6 such as:
1 4 6 4 5 7 1
2 3 5 4 6 7 1
5 4 6 4 3 8 4
5 7 3 3 3 6 3
6 7 3 3 4 8 4
I have tried all kinds of combinations between sort and uniq but everything fails because uniq cannot be applied onto a specific column. The only thing I came up with is to change the order of the columns as to first sort as above then move records 2,3,and 4 to the end and then run uniq with -w as to focus only on the last 3 records. This seems quite inefficient to me.
Thanks for help!
You can achieve this with two passes of sort(assuming in the first place I understand your requirement correctly, seeing that the desired data snippet posted above does not match your description of it) . The first pass sorts by field 2 through 4 ascending and field 6 descending, the second pass sorts on fields 2 through 4 only but passing in the "stable sort" and unique flags in addition to pick out those rows for each combination of fields 2-4 that have the highest value from field 6
sort -k2,4n -k6,6nr file.txt | sort -k2,4n -s -u
2 3 5 4 6 7 1
5 4 6 4 3 8 4
6 7 3 3 4 8 4
I have problem with LINQ Query in following scenario:
I have Activity and ActivityTeacher Two Table and List of Some Teachers.
Activity Table
ActivityID Date Class
1 4/4/2012 1
2 4/5/2013 2
3 4/6/2013 5
4 5/6/2013 2
5 5/16/2013 1
6 5/20/2013 8
7 5/21/2013 7
8 6/22/2013 6
9 8/10/2013 5
10 8/12/2013 4
ActivityTeacher Table
ActivityID TeacherID
1 2
1 3
1 4
2 6
3 6
3 6
3 4
2 5
4 2
4 3
4 6
5 8
5 7
5 6
6 6
6 7
6 9
6 10
6 1
6 2
7 2
7 8
7 9
7 10
8 3
8 4
8 6
8 7
9 10
9 3
9 2
10 1
10 2
List of Teachers={2,3,4}
Now I want to select records from Activity based on List of Teachers={2,3,4}
without using foreach loop.
The Activity entity should have a Teachers navigation property you can utilize:
context.Activities
.Where(x => listOfTeachers.Contains(x.Teachers.Select(t => t.TeacherId)));
If listOfTeachers contains the three IDs 2, 3, 4, this query should translate to SQL that is similar to the following:
select a.*
from Activity a
inner join ActivityTeacher at
on a.activityid = at.activityid
where at.teacherid in (2, 3, 4);