Hive - hourly window averages - hadoop

I have data like this in a Hive table:
+-------------------+-------+---------+--------+
| _c0 | name | value0 | value1 |
+-------------------+-------+---------+--------+
| 2015-10-07 13:01 | john | 10.0 | 100 |
| 2015-10-07 13:20 | john | 20.0 | 200 |
| 2015-10-07 13:41 | john | 15.0 | 300 |
| 2015-10-07 14:00 | john | 30.0 | 300 |
| 2015-10-07 14:20 | john | 60.0 | 200 |
| 2015-10-07 14:40 | john | 30.0 | 400 |
I need to get hourly averages.
| 2015-10-07 13:00 | john | 15.0 | 200 |
| 2015-10-07 14:00 | john | 40.0 | 300 |
I have an idea about doing it using a partition/over clause in psql but I'm not sure how to do this in Hive. An idea would be to split datetime into date and hour (e.g."2015-10-07 13") and use a group by and avg function, but that is probably not the best way.
Any ideas?

You should do it the way you suggested to do it. If you are just wanting the average by date and hour (and name probably), partitioning and using an over clause is not necessary.
Query:
select date, hour, name, avg(value0) avg0, avg(value1) avg1
from (
select split(_c0, ' ')[0] date
, split(split(_c0, ' ')[1], '\\:')[0] hour
, name
, value0
, value1
from db.table ) x
group by date, hour, name

Related

Distinct aggregation in pre-calculated measure (MDX)

There are two measures in one fact table \ dimension. Measure 'YearTotal' should somehow be pre-calcuated as a distinct value for any futher summing (aggregating). And 'YearTotal' can't be derived from 'YearDetail' measure, so they are completely independent.
+-------------+---------------+-----------+------------+
| AccountingID | Date | TotalYear | YearDetail |
+--------------+---------------+-----------+------------+
| account1 | 31.12.2012 | 500 | 7 |
| account1 | 31.12.2012 | 500 | 3 |
| account1 | 31.12.2012 | 500 | 1 |
| account2 | 31.12.2012 | 900 | 53 |
| account2 | 31.12.2012 | 900 | 4 |
| account2 | 31.12.2012 | 900 | 9 |
| account3 | 31.12.2012 | 203 | 25 |
| account3 | 31.12.2012 | 203 | 11 |
| account3 | 31.12.2012 | 203 | 17 |
+--------------+---------------+-----------+------------+
So, the question: What should be in (pre)calculated measure expression to get such a result:
select
(
[Accounting Dim].[Account ID]
[Dim Date].[Calendar Year].&[2012]
) ON COLUMNS
from [Cube]
WHERE [Measures].[YearTotal]
in case of correct expression the answer would be --> (500+900+203) = 1603
(and optionaly): maybe there is a common distinct pattern solution for any other simple types of aggregation
Maybe go with MAX at the column level [TotalYear] and enforce a specific level of calculation.
CREATE MEMBER [Measures].[YearTotal] AS
MAX(
(
[Calendar Dim].[Year].[All].Children,
[Accounting Dim].[Account ID].[All].Children
),
[Measures].[TotalYear]
)

DAX Query with multiple filters in powerbi

I have two tables 'locations' and 'markets', where, a many to many relationship exists between these two tables on the column 'market_id'. A report level filter has been applied on the column 'entity' from 'locations' table. Now, I'm supposed to distinctly count the 'location_id' from 'markets' table where 'active=TRUE'. How can I write a DAX query such that the distinct count of location_id dynamically changes with respect to the selection made in the report level filter?
Below is an example of the tables:
locations:
| location_id | market_id | entity | active |
|-------------|-----------|--------|--------|
| 1 | 10 | nyc | true |
| 2 | 20 | alaska | true |
| 2 | 20 | alaska | true |
| 2 | 30 | miami | false |
| 3 | 40 | dallas | true |
markets:
| location_id | market_id | active |
|-------------|-----------|--------|
| 2 | 20 | true |
| 2 | 20 | true |
| 5 | 20 | true |
| 6 | 20 | false |
I'm fairly new to powerbi. Any help will be appreciated.
Here you go:
DistinctLocations = CALCULATE(DISTINCTCOUNT(markets[location_id]), markets[active] = TRUE())

How to skip the "-- More --" in PostgreSQL pager?

When PostgreSQL spits a long output (eg. SELECT * FROM table_name for a table with 1000 lines), it will only show the first 50 or so lines.
There's a "-- More --" line at the bottom.
If you press enter, it will show the next line.
How can I skip the long PostgreSQL's output?
This is a PostgreSQL 12 run on cmd on Windows 10.
I tried:
\q on the --more-- line, then enter: just show the next line
\pset pager off is my current solution, although no ideal for showing table with 1000 row
dbname=# SELECT * FROM table_name;
id | first_name | last_name | email | gender | date_of_birth | country_of_birth
------+----------------+-------------------+---------------------------------------+--------+---------------+----------------------------------
1 | Erroll | Craisford | xxx#yyy.zzz | Male | 2019-05-28 | Indonesia
2 | Son | Smitherman | xxx#yyy.zzz | Male | 2019-02-16 | Indonesia
3 | Dion | Primo | xxx#yyy.zzz | Female | 2018-12-14 | Thailand
4 | Florette | Waldock | | Female | 2019-05-23 | Palestinian Territory
5 | Roderick | Stowte | xxx#yyy.zzz | Male | 2019-01-24 | Poland
6 | Hi | Kleeman | xxx#yyy.zzz | Male | 2019-01-26 | Indonesia
7 | Ethelind | Gard | xxx#yyy.zzz | Female | 2018-11-05 | France
8 | Bartel | Melhuish | | Male | 2019-02-18 | Vietnam
9 | Smith | Gavahan | xxx#yyy.zzz | Male | 2019-05-04 | Sweden
10 | Harmonia | Defrain | xxx#yyy.zzz | Female | 2018-12-17 | France
11 | Eulalie | Cuerdale | xxx#yyy.zzz | Female | 2019-05-09 | Angola
12 | Floria | Bernette | xxx#yyy.zzz | Female | 2019-07-07 | China
13 | Ruddy | Scargle | xxx#yyy.zzz | Male | 2019-08-27 | Norway
14 | Vinson | Capewell | xxx#yyy.zzz | Male | 2019-01-24 | Portugal
15 | Eben | Yellep | xxx#yyy.zzz | Male | 2019-03-12 | Mexico
16 | Yolande | Blaasch | xxx#yyy.zzz | Female | 2019-01-22 | Philippines
17 | Tiphani | Whitlow | xxx#yyy.zzz | Female | 2019-01-01 | New Zealand
18 | Alvina | Carne | xxx#yyy.zzz | Female | 2019-03-01 | Peru
19 | Peg | Hains | xxx#yyy.zzz | Female | 2019-02-22 | Indonesia
20 | Arlana | Sibson | xxx#yyy.zzz | Female | 2019-06-15 | Niger
21 | Rabi | Slimme | xxx#yyy.zzz | Male | 2019-03-03 | Belarus
22 | Marianna | Gouthier | | Female | 2019-05-06 | Sweden
-- More --
Just press q.
For more options see man page http://man7.org/linux/man-pages/man1/more.1.html#COMMANDS

View count rows as columns in query result

First thing first: I am able to get the data one way. My purpose is to increase the readability of my query result. I am seeking if it is possible.
I have a table that fed by devices. I want to get the number of data sent on each hour that was grouped by two identical columns. Grouping these two columns is needed to determine one device type.
Table structure is like:
| identifier-1 | identifier-2 | day | hour | data_name | data_value |
|--------------|--------------|------------|------|-----------|------------|
| type_1 | subType_4 | 2016-08-25 | 0 | Key-30 | 4342 |
|--------------|--------------|------------|------|-----------|------------|
| type_3 | subType_2 | 2016-08-25 | 0 | Key-50 | 96 |
|--------------|--------------|------------|------|-----------|------------|
| type_6 | subType_2 | 2016-08-25 | 1 | Key-44 | 324 |
|--------------|--------------|------------|------|-----------|------------|
| type_2 | subType_1 | 2016-08-25 | 1 | Key-26 | 225 |
|--------------|--------------|------------|------|-----------|------------|
I'm going to use one specific data_name which was sent by all devices, and getting the count of this data_name will give me the data sent on each hour. It is possible to get the number in 24 rows as grouping by identifier-1,identifier-2, day and hour. However, they will repeat for each device type.
| identifier-1 | identifier-2 | day | hour | count |
|--------------|--------------|------------|------|-------|
| type_6 | subType_2 | 2016-08-25 | 0 | 340 |
|--------------|--------------|------------|------|-------|
| type_6 | subType_2 | 2016-08-25 | 1 | 340 |
|--------------|--------------|------------|------|-------|
|--------------|--------------|------------|------|-------|
| type_1 | subType_4 | 2016-08-25 | 0 | 32 |
|--------------|--------------|------------|------|-------|
| type_1 | subType_4 | 2016-08-25 | 1 | 30 |
|--------------|--------------|------------|------|-------|
|--------------|--------------|------------|------|-------|
|--------------|--------------|------------|------|-------|
I want to view the result like this:
| identifier-1 | identifier-2 | day | count_of_0 | count_of_1 |
|--------------|--------------|------------|------------|------------|
| type_6 | subType_2 | 2016-08-25 | 340 | 340 |
|--------------|--------------|------------|------------|------------|
| type_1 | subType_4 | 2016-08-25 | 32 | 30 |
|--------------|--------------|------------|------------|------------|
|--------------|--------------|------------|------------|------------|
In SQL, it is possible to get subqueries and columns in result but it is not possible on Hive. I guess it is called correlated subqueries.
Hive column as a subquery select
Answer of this question did not work for me.
Do you have any idea or suggestion?
You can do this using conditional aggregation:
select identifier1, identifier2, day,
sum(case when hour = 0 then data_value else 0 end) as cnt_0,
sum(case when hour = 1 then data_value else 0 end) as cnt_1
from t
where data_name = ??
group by identifier1, identifier2, day
order by identifier1, identifier2, day

MySQL equivalent of ORACLES rank()

Oracle has 2 functions - rank() and dense_rank() - which i've found very useful for some applications. I am doing something in mysql now and was wondering if they have something equivalent to those?
Nothing directly equivalent, but you can fake it with some (not terribly efficient) self-joins. Some sample code from a collection of MySQL query howtos:
SELECT v1.name, v1.votes, COUNT(v2.votes) AS Rank
FROM votes v1
JOIN votes v2 ON v1.votes < v2.votes OR (v1.votes=v2.votes and v1.name = v2.name)
GROUP BY v1.name, v1.votes
ORDER BY v1.votes DESC, v1.name DESC;
+-------+-------+------+
| name | votes | Rank |
+-------+-------+------+
| Green | 50 | 1 |
| Black | 40 | 2 |
| White | 20 | 3 |
| Brown | 20 | 3 |
| Jones | 15 | 5 |
| Smith | 10 | 6 |
+-------+-------+------+
how about this "dense_rank implement" in MySQL
CREATE TABLE `person` (
`id` int(11) DEFAULT NULL,
`first_name` varchar(20) DEFAULT NULL,
`age` int(11) DEFAULT NULL,
`gender` char(1) DEFAULT NULL);
INSERT INTO `person` VALUES
(1,'Bob',25,'M'),
(2,'Jane',20,'F'),
(3,'Jack',30,'M'),
(4,'Bill',32,'M'),
(5,'Nick',22,'M'),
(6,'Kathy',18,'F'),
(7,'Steve',36,'M'),
(8,'Anne',25,'F'),
(9,'Mike',25,'M');
the data before dense_rank() like this
mysql> select * from person;
+------+------------+------+--------+
| id | first_name | age | gender |
+------+------------+------+--------+
| 1 | Bob | 25 | M |
| 2 | Jane | 20 | F |
| 3 | Jack | 30 | M |
| 4 | Bill | 32 | M |
| 5 | Nick | 22 | M |
| 6 | Kathy | 18 | F |
| 7 | Steve | 36 | M |
| 8 | Anne | 25 | F |
| 9 | Mike | 25 | M |
+------+------------+------+--------+
9 rows in set (0.00 sec)
the data after dense_rank() like this,including "partition by" function
+------------+--------+------+------+
| first_name | gender | age | rank |
+------------+--------+------+------+
| Anne | F | 25 | 1 |
| Jane | F | 20 | 2 |
| Kathy | F | 18 | 3 |
| Steve | M | 36 | 1 |
| Bill | M | 32 | 2 |
| Jack | M | 30 | 3 |
| Mike | M | 25 | 4 |
| Bob | M | 25 | 4 |
| Nick | M | 22 | 6 |
+------------+--------+------+------+
9 rows in set (0.00 sec)
the query statement is
select first_name,t1.gender,age,FIND_IN_SET(age,t1.age_set) as rank from person t2,
(select gender,group_concat(age order by age desc) as age_set from person group by gender) t1
where t1.gender=t2.gender
order by t1.gender,rank

Resources