Hive SELECT DISTINCT and GROUP BY in a subquery - hadoop

I am running a query but I'm a little stuck on the concept of subqueries in HiveQL. I am new to Hive and I've done a lot of reading but I still can't get it to work.
So I have a big table with the fields I'm interested in being created_date and size. So I basically wan to run an aggregation of the sum of sizes of files created in a particular year and group by distinct year.
My current query:
SELECT year(created_date), SUM(size) FROM <tablename> GROUP BY created_date
2001 2654567
2001 231818
2001 1978222
2002 7625332
2002 6272829
2003 2733792
This gives me a list of all the years in the table and the sums of each year as above but I have duplicates of the year and this is where I need to do a subquery to SELECT DISTINCT year and the sum the total size too.
Any help will be superb please.

You might want to try GROUPING BY the year, (since that is what you are selecting).
SELECT year(created_date), SUM(size) FROM <tablename> GROUP BY year(created_date)

Related

How can rebuild index in IOT(Index organized table)?

Dear all experts.
I have IOT having 7 million records in oracle database, eventually iot use for fast access primary key but in my case, when i select primary key column it takes 5-4 seconds for select single column.
My query is:
Select Emp_Refno from Emp_master where Rownum =1 order
by Emp_Refno asc;
I have also used Sql Tunning Advisor for optimize it and also get index suggest ion from SQL Tunning Advisor and also applied it, But in explain plan not seen this index and it takes same time after it.
I'm curious if the following query has the same execution time:
select * from (select Emp_Refno from Emp_master order by Emp_Refno asc) where rownum = 1
This is how I usually write top-n queries for Oracle.

What's the best practice to filter out specific year in query in Netezza?

I am a SQL Server guy and just started working on Netezza, one thing pops up to me is a daily query to find out the size of a table filtered out by year: 2016,2015, 2014, ...
What I am using now is something like below and it works for me, but I wonder if there is a better way to do it:
select count(1)
from table
where extract(year from datacolumn) = 2016
extract is a built-in function, applying a function on a table with size like 10 billion+ is not imaginable in SQL Server to my knowledge.
Thank you for your advice.
The only problem i see with the query is the where clause which executes a function on the 'variable' side. That effectively disables zonemaps and thus forces netezza to scan all data pages, not only those with data from that year.
Instead write something like:
select count(1)
from table
where datecolumn between '2016-01-01' and '2016-12-31'
A more generic alternative is to create a 'date dimension table' with one row per day in your tables (and a couple of years into the future)
This is an example for Postgres: https://medium.com/#duffn/creating-a-date-dimension-table-in-postgresql-af3f8e2941ac
This enables you to write code like this:
Select count(1)
From table t join d_date d on t.datecolumn=d.date_actual
Where year_actual=2016
You may not have the generate_series() function on your system, but a 'select row_number()...' can do the same trick. A download is available here: https://www.ibm.com/developerworks/community/wikis/basic/anonymous/api/wiki/76c5f285-8577-4848-b1f3-167b8225e847/page/44d502dd-5a70-4db8-b8ee-6bbffcb32f00/attachment/6cb02340-a342-42e6-8953-aa01cbb10275/media/generate_series.tgz
A couple of further notices in 'date interval' where clauses:
Those columns are the most likely candidate for a zonemaps optimization. Add a 'organize on (datecolumn)' at the bottom of your table DDL and organize your table. That will cause netezza to move around records to pages with similar dates, and the query times will be better.
Furthermore you should ensure that the 'distribute on' clause for the table results in an even distribution across data slices of the table is big. The execution of the query will never be faster than the slowest dataslice.
I hope this helps

MonetDB: Group by different parts of a timestamp

I have a timestamp column in a monetdb table which I want to occasionally group by hour and occasionally group by day or month. What is the most optimal way of doing this in MonetDB?
In say postgres you could do something like:
select date_trunc('day', order_time), count(*)
from orders
group by date_trunc('day', order_time);
Which I appreciate would not use an index, but is there any way of doing this in MonetDB without creating additional date columns holding day, month and year truncated values?
Thanks.
You could use the EXTRACT(DAY FROM order_time) possibly as part of a subquery before grouping.
It might be a little late for answer, but the following should work for truncating to day precision:
SELECT CAST(order_time AS DATE) AS order_date, count(*)
FROM orders
GROUP BY order_date;
It works by casting the timestamp value to type DATE which is a MonetDB built-in type and the cast is pretty fast.
It does not have the flexibility of date_trunc in Postgres, but if you need to go to monthly of yearly precision, you could use the somewhat slower but usable EXTRACT to get the relevant parts of the timestamp and group by them. For monthly grouping, you could do:
SELECT EXTRACT(YEAR FROM order_time) AS y,
EXTRACT(MONTH FROM order_time) AS m,
count(*)
FROM orders GROUP BY y, m;
The only disadvantage is that you will have the date split to two columns.

Explanation of an SQL query

I recently got some help for an oracle query and don't quite understand how it works and thus can't get it to work with my data. Is anyone able to explain the logic of what is happening in logical steps and what variables are actually taken from an existing table's columns? I am looking to select data from a table of readings (column names are: day, hour, volume) and find the average reading of volume for each hour of each day (thus GROUP BY day, hour), by going back to all readings for that hour/day combination in the past (as far back as my dataset goes) and writing out the average for it. Once that is done, it will write the results to a different table with the same column names (day, hour, volume). Except when I write it back on a per hour basis, 'volume' will be the average for that hour of the day in the past. For example, I want to find what the average was for all Wednesdays at 7pm in the past, and output the average to a new record. Assuming these 3 columns were used and in reference to the code below, I am not sure how "hours" differs to "hrs" and what the t1 variable represents. Any help is appreciated.
INSERT INTO avg_table (days, hours, avrg)
WITH xweek
AS (SELECT ds, LPAD (hrs, 2, '0') hrs
FROM ( SELECT LEVEL ds
FROM DUAL
CONNECT BY LEVEL <= 7),
( SELECT LEVEL - 1 hrs
FROM DUAL
CONNECT BY LEVEL <= 24))
SELECT t1.ds, t1.hrs, AVG (volume)
FROM xweek t1, tables t2
WHERE t1.ds = TO_CHAR (t2.day(+), 'D')
AND t1.hrs = t2.hour(+)
GROUP BY t1.ds, t1.hrs;
I'd re-write this slightly so it makes more sense (to me at least).
To break it down bit by bit, CONNECT BY is a hierarchical (recursive) query. This is a common "cheat" to generate rows. In this case 7 to represent each day of the week, numbered 1 to 7.
SELECT LEVEL ds
FROM DUAL
CONNECT BY LEVEL <= 7
The next one generates the hours 0 to 23 to represent midnight to 11pm. These are then joined together in the old style in a Cartesian or CROSS JOIN. This means that every possible combination of rows is returned, i.e. it generates every hour of every day for a single week.
The WITH clause is described in the documentation on the SELECT statement, it is commonly known as a Common Table Expression (CTE), or in Oracle the Subquery Factoring Clause. This enables you to assign a name to a sub-query and reference that single sub-query in multiple places. It can also be used to keep code clean or generate temporary tables in memory for ready access. It's not required in this case but it does help to separate the code nicely.
Lastly, the + is Oracle's old notation for outer joins. They are mostly equivalent but there are a few very small differences that are described in this question and answer.
As I said at the beginning I would re-write this to conform to the ANSI standard because I find it more readable
insert into avg_table (days, hours, avrg)
with xweek as (
select ds, lpad(hrs, 2, '0') hrs
from ( select level ds
from dual
connect by level <= 7 )
cross join ( select level - 1 hrs
from dual
connect by level <= 24 )
)
select t1.ds, t1.hrs, avg(volume)
from xweek t1
left outer join tables t2
on t1.ds = to_char(t2.day, 'd')
and t1.hrs = t2.hour
group by t1.ds, t1.hrs;
To go into slightly more detail the t1 variable represents an alias for the CTE week1, it's so you don't have to type the entire thing each time. hrs is an alias for the generated expression, as you reference it explicitly you need to call it something. HOURS is a column in your own table.
As to whether this is doing the correct thing I'm not sure, you imply you only want it for a single day rather than the entire week so only you can decide if this is correct? I also find it a little strange that you need the HOURS column in your table to be a character left-padded with 0s lpad(hrs, 2, '0'), once again, only you know if this is correct.
I would highly recommend playing about with this yourself and working out how everything goes together. You also seem to be missing some of the basics, get a text book or look around on the internet, or Stack Overflow, there's plenty of examples.

Joining two tables in hive

I have table where I have partitioned date by year and month and date
'ABC' Partition by
(year='2011', month='08', day='01')
I want to run a query something like
select * from ABC where dt>='2011-03-01' and dt<='2012-02-01';
How can I run this query with above partitioning scheme in terms of year, month and day?
You might consider creating an external table that is partitioned by 'yyyy-mm-dd', and uses the same locations as your existing table. You won't have to copy any data, and you'll have the flexibility of both partitioning formats.
select * from ABC where year='2011' and month >= '03'
UNION
select * from ABC where year='2012' and month = '01'
UNION
select * from ABC where year='2012' and month='02' and day='01';
The above query should solve the purpose but it's really neither flexible nor well-readable. Like Matt suggested, a better partitioning format would be of a single string variable in yyyy-MM-dd format as the partitioning column. However, you might have to make a copy of the data if you change the partitioning scheme for year, month, day to dt. In my opinion though, it's totally worth it.

Resources