Hive - How to display 0 for partitions that are empty? - hadoop

There is 1 partition for date in my table. For that partition, there are 2 folders in hdfs, e.g. 01-01-2018 and 02-01-2018. In 02-01-2018 there is no data at all.
So when I query like below:
select date, count(*) from table1 group by date
It will only return the output for partition with data like below:
01-01-2018 168
I need the output to be like below:
01-01-2018 168
02-01-2018 0
How to achieve that output? Thank you.

Related

Power Bi count rows for all tables in one measure

In my power Bi I would like to count rows for all my tables and having this output:
Table Name
Row count
Table1
126
Table2
985
Table3
998
...
...
As long as I have few tables I can do
NEWTABLE = UNION(
ROW("TableName","Table1", "Rowcount",ROWSCOUNT(Table1)),
ROW("TableName","Table2", "Rowcount",ROWSCOUNT(Table2)),
...
)
But this starts to be complicated when I have many tables.
Is there a way I can do it? Like a loop or something?
Thank you
If you only need a metrics then you can use DaxStudio -> ViewMetrics
where cardinality is your "rowCounts"
If you need something more, then you can get all table name from DMV
select * from $SYSTEM.TMSCHEMA_TABLES
populate this as another table in your model, and use M language to loop through.
here useful example:
https://community.powerbi.com/t5/Power-Query/Power-query-Counting-rows-from-all-table-in-query-editor-but-not/td-p/1198489

Select single random sample from group by in Hive

I have a table that looks like so:
Name Age Num_Hobbies Num Shoes
Jane 31 10 2
Bob 23 3 4
Jane 60 2 200
Jane 31 100 6
Bob 10 8 7
etc etc
I would like to group this table by Name and Age, and at random pick one row from the rest of the columns.
In pandas, I would do the following:
df.groupby(['Name', 'Age']).apply(lambda x: x.sample(n=1))
In hive, I know how to create the group, but not how to choose a single random sample from group.
I saw this question on stack overflow: How to sample for each group in hive?
However, I do not understand how to apply Dynamic partitions or Hive bucketing to select a single sample from a group.
You can use rank() or row_number() with rand()
select * from
(
select name,age,rank() (partition by name,age order by rand()) as rank
from table
) t
where rank = 1

convert string of a column to multiple rows

For data like below
Col1
----
1
23
34
124
Output should be like below
Out
1
2
3
4
I tried the below hierarchical query but its giving repeated data
select substr(col1, level, 1)
from table1
connect by level <= length(col1);
I can't use distinct as this is sample and main table where I have to use this query has quite large data.
Thanks

Tune oracle query with groupby clause

I have a table with Lots of cost columns for each Key
TableA
SK1 SK2 Col1 Col2 Col3..... Col50 Flg(Y/N)
1 2 10 20 30 ...... 500 Y
1 2 10 20 30 ...... 500 N
2 2 10 20 30 ...... 500 N
I need to aggregate(sum) of all values and then check if there are any values with Y then add them to new tableB.
Here table A record combination (1,2) for (sk1,sk2) should be returned.
The i have written query is to select lisr of all cols and add as group by.
We have lots of data so this query is taking too long to run. Any chance to relook into this and do so that it can become faster.
select
Sk1,
Sk2,
nvl(sum(col3),0),
nvl(sum(col4))0,
.....
nvl(sum(col50))
from table A
group by Sk1,
Sk2
Iam using this as part of large query where in many other calculations are performed on top of this.
Working out whether any of a grouped set of records contains a 'Y' would be as simple as ...
select ...
from ...
group by ...
having max(flg) = 'Y'
For now i have created a temporary table and have loaded all the data into it.
If you are using this as part of large query, did you try WITH option?
It could be like this
WITH SUM_DATA AS (select col1, col2, nvl(sum(col3),0), nvl(sum(col4))0, ..... nvl(sum(col50)) from table A group by col1, col2)
SELECT xyz
FROM abc, sum_data
WHERE abc.join_col = sum_data.join_col
More help here

Set variables in HIVE query

I am trying to follow the post here to set a variable in my Hive query. Assuming I've the following file in hdfs:
/home/hduser/test/hr.txt
Berg,12000
Faviet,9000
Chen,8200
Urman,7800
Sciarra,7700
Popp,6900
Paino,8790
I then created my schema on top of the data as follows:
CREATE EXTERNAL TABLE IF NOT EXISTS employees (lname STRING, salary INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/home/hduser/test/';
I want to create 4 tiles for the table but I don't want to hardcode the number of tiles and instead want to pass it in as a variable. My code is below:
SET q1=select ceiling(count(*)/2) from employees;
SELECT lname,
salary,
NTILE(${hiveconf:q1}) OVER (
ORDER BY salary DESC) AS quartile
FROM employees;
However, this throws an error:
FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies.
Underlying error: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: Number of tiles must be an int expression
I tried to use quotes when calling the variable, as in '${hiveconf:q1}', but that didn't seem to help. If I hardcode the number of tiles (which I am trying to avoid), the workflow will go something like this:
SELECT lname,
salary,
NTILE(4) OVER (
ORDER BY salary DESC) AS quartile
FROM employees;
which yields
Berg 12000 1
Faviet 9000 1
Paino 8790 2
Chen 8200 2
Urman 7800 3
Sciarra 7700 3
Popp 6900 4
Thoughts?
When there isn't a documented way one can use documented features to provide a clean enough hack :)
Here's my attempt, using dfs commands from hive, shell commands from hive, the source-command and what not. I guess it might not work out of the box with queries through Hiveserver2. I would be glad if there were a prettier way
Let's go
Basic setup
SET EMPLOYEE_TABLE_LOCATION=/home/hduser/test/;
CREATE EXTERNAL TABLE IF NOT EXISTS employees (lname STRING, salary INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '${hiveconf:EMPLOYEE_TABLE_LOCATION}';
SET PATH_TO_SETTINGS_FILE=hdfs:/tmp/query_to_setting;
SET FILENAME_ON_LOCAL_FS=query_to_setting.sql;
Generate a file in hdfs
with content "SET q1=<the-query-result>;"
CREATE TABLE query_to_setting_table
LOCATION '${hiveconf:PATH_TO_SETTINGS_FILE}'
AS
SELECT concat('SET q1=', ceiling(count(*)/2),'\;') from employees;
Source in the generated file as any sql-file.
First put the file to local fs since 'source' only operates on local disk...
dfs -get ${hiveconf:PATH_TO_SETTINGS_FILE}/000000_0 ${hiveconf:FILENAME_ON_LOCAL_FS};
source ${hiveconf:FILENAME_ON_LOCAL_FS};
Try the setting
hive> SET q1;
q1=4
Use the setting in a query
hive > SELECT lname,
salary,
NTILE( ${hiveconf:q1}) OVER (
ORDER BY salary DESC) AS quartile
FROM employees;
OK
Berg 12000 1
Faviet 9000 1
Paino 8790 2
Chen 8200 2
Urman 7800 3
Sciarra 7700 3
Popp 6900 4
Optional cleanup
!rm ${hiveconf:FILENAME_ON_LOCAL_FS};
DROP TABLE query_to_setting_table;

Resources