how to count number of words in each column delimited by "|" seperator using hive? - hadoop

input data is
+----------------------+--------------------------------+
| movie_name | Genres |
+----------------------+--------------------------------+
| digimon | Adventure|Animation|Children's |
| Slumber_Party_Massac | Horror |
+----------------------+--------------------------------+
i need output like
+----------------------+--------------------------------+-----------------+
| movie_name | Genres | count_of_genres |
+----------------------+--------------------------------+-----------------+
| digimon | Adventure|Animation|Children's | 3 |
| Slumber_Party_Massac | Horror | 1 |
+----------------------+--------------------------------+-----------------+

select *
,size(split(coalesce(Genres,''),'[^|\\s]+'))-1 as count_of_genres
from mytable
This solution covers varying use-cases, including -
NULL values
Empty strings
Empty tokens (e.g. Adventure||Animation orAdventure| |Animation )

This is a really, really bad way to store data. You should have a separate MovieGenres table with one row per movie and per genre.
One method is to use length() and replace():
select t.*,
(1 + length(genres) - length(replace(genres, '|', ''))) as num_genres
from t;
This assumes that each movie has at least one genre. If not, you need to test for that as well.

Related

Automatically generating documentation about the structure of the database

There is a database that contains several views and tables.
I need create a report (documentation of database) with a list of all the fields in these tables indicating the type and, if possible, an indication of the minimum/maximum values and values from first row. For example:
.------------.--------.--------.--------------.--------------.--------------.
| Table name | Column | Type | MinValue | MaxValue | FirstRow |
:------------+--------+--------+--------------+--------------+--------------:
| Table1 | day | date | ‘2010-09-17’ | ‘2016-12-10’ | ‘2016-12-10’ |
:------------+--------+--------+--------------+--------------+--------------:
| Table1 | price | double | 1030.8 | 29485.7 | 6023.8 |
:------------+--------+--------+--------------+--------------+--------------:
| … | | | | | |
:------------+--------+--------+--------------+--------------+--------------:
| TableN | day | date | ‘2014-06-20’ | ‘2016-11-28’ | ‘2016-11-16’ |
:------------+--------+--------+--------------+--------------+--------------:
| TableN | owner | string | NULL | NULL | ‘Joe’ |
'------------'--------'--------'--------------'--------------'--------------'
I think the execution of many queries
SELECT MAX(column_name) as max_value, MIN(column_name) as min_value
FROM table_name
Will be ineffective on the huge tables that are stored in Hadoop.
After reading documentation found an article about "Statistics in Hive"
It seems I must use request like this:
ANALYZE TABLE tablename COMPUTE STATISTICS FOR COLUMNS;
But this command ended with error:
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.ColumnStatsTask
Do I understand correctly that this request add information to the description of the table and not display the result? Will this request work with view?
Please suggest how to effectively and automatically create documentation for the database in HIVE?

Hive: How do I join with a between dates condition?

I have table of items:
| id | dateTimeUTC | color |
+----+------------------+-------+
| 1 | 1/1/2001 1:11:11 | Red |
+----+------------------+-------+
| 2 | 2/2/2002 2:22:22 | Blue |
+----+------------------+-------+
It contains some events with a dateTime in it. I also have an events table:
| eventID | startDate | endDate |
+---------+-------------------+------------------+
| 1 | 1/1/2001 1:11:11 | 2/2/2002 2:22:22 |
+---------+-------------------+------------------+
| 2 | 3/3/2003 00:00:00 | 3/3/2003 1:11:11 |
+---------+-------------------+------------------+
I want to join the two, getting where the dateTimeUTC of the item table is in between the start and end date of the events table. Now, to do this in sql is pretty standard, but HQL not so much. Hive doesn't let you have anything but an "=" in the join clause. (Link to HIVE info here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins). Now, there was a question about a similar situation before here, but I found that it's been 4 years since then and have hoped there was a solution.
Any tips on how to make this happen?
I think you have string format for dates in tables , If yes use following ... Making date into standard format.
select * from items_x, items_date where UNIX_TIMESTAMP(dateTimeUTC,'dd/MM/yyyy HH:MM:SS') between UNIX_TIMESTAMP(startDate,'DD/MM/YYYY HH:MM:SS') and UNIX_TIMESTAMP(endDate,'DD/MM/YYYY HH:MM:SS') ;

(Nested?) Select statement with MAX and WHERE clause

I'm cranking my head on a set of data in order to generate a report from a Oracle DB.
Data are in two tables:
SUPPLY
DEVICE
There is only one column that links the two tables:
SUPPLY.DEVICE_ID
DEVICE.ID
In SUPPLY, there are these data: (Markdown is not working well. it's supposed to show a table)
| DEVICE_ID | COLOR_TYPE | SERIAL | UNINSTALL_DATE |
|----------- |------------ |-------------- |--------------------- |
| 1232 | 1 | CAP857496 | 08/11/2016,19:10:50 |
| 5263 | 2 | CAP57421 | 07/11/2016,11:20:00 |
| 758 | 3 | CBO753421869 | 07/11/2016,04:25:00 |
| 758 | 4 | CC9876543 | 06/11/2016,11:40:00 |
| 8575 | 4 | CVF75421 | 05/11/2016,23:59:00 |
| 758 | 4 | CAP67543 | 30/09/2016,11:00:00 |
In DEVICE, there are columns that I've to select all (more or less), but each row is unique.
What i need to achieve is:
for each SUPPLY.DEVICE_ID and SUPPLY.COLOR_TYPE, I need the most recent ROW -> MAX(UNINSTALL_DATE)
JOINED with
more or less all the columns in DEVICE.
At the end I should have something like this:
| ACCOUNT_CODE | MODEL | DEVICE.SERIAL | DEVICE_ID | COLOR_TYPE | SUPPLY.SERIAL | UNINSTALL_DATE |
|-------------- |------- |--------------- |----------- |------------ |--------------- |--------------------- |
| BUSTO | MS410 | LM753 | 1232 | 1 | CAP857496 | 08/11/2016,19:10:50 |
| MACCHI | MX310 | XC876 | 5263 | 2 | CAP57421 | 07/11/2016,11:20:00 |
| ASL_COMO | MX711 | AB123 | 758 | 3 | CBO753421869 | 07/11/2016,04:25:00 |
| ASL_COMO | MX711 | AB123 | 758 | 4 | CC9876543 | 06/11/2016,11:40:00 |
| ASL_VARESE | X950 | DE8745 | 8575 | 4 | CVF75421 | 05/11/2016,23:59:00 |
So far, using a nested select like:
SELECT DEVICE_ID,COLOR_TYPE,SERIAL,UNINSTALL_DATE FROM
(SELECT SELECT DEVICE_ID,COLOR_TYPE,SERIAL,UNINSTALL_DATE
FROM SUPPLY WHERE DEVICE_ID = '123456' ORDER BY UNINSTALL_DATE DESC)
WHERE ROWNUM <= 1
I managed to get the highest value on the UNISTALL_DATE column after trying MAX(UNISTALL_DATE) or HIGHEST(UNISTALL_DATE).
I tried also:
SELECT SUPPLY.DEVICE_ID, SUPPLY.COLOR_TYPE, ....
FROM SUPPLY,DEVICE WHERE SUPPLY.DEVICE_ID = DEVICE.ID
and it works, but gives me ALL the items, basically it's a merge of the two tables.
When I try to narrow the data selected, i get errors or a empty result.
I'm starting to wonder that it's not possible to obtain this data and i'm starting to export the data in excel and work from there, but I wish someone can help me before giving up...
Thank you in advance.
for each SUPPLY.DEVICE_ID and SUPPLY.COLOR_TYPE, I need the most recent ROW -> MAX(UNINSTALL_DATE)
Use ROW_NUMBER function in this way:
SELECT s.*,
row_number() OVER (
PARTITION BY DEVICE_ID, COLOR_TYPE
ORDER BY UNINSTALL_DATE DESC
) As RN
FROM SUPPLY s
This query marks most recent rows with RN=1
JOINED with more or less all the columns in DEVICE.
Just join the above query to DEVICE table
SELECT d.*,
x.COLOR_TYPE,
x.SERIAL,
x.UNINSTALL_DATE
FROM (
SELECT s.*,
row_number() OVER (
PARTITION BY DEVICE_ID, COLOR_TYPE
ORDER BY UNINSTALL_DATE DESC
) As RN
FROM SUPPLY s
) x
JOIN DEVICE d
ON d.DEVICE_ID = x.DEVICE_ID AND x.RN=1
OK - so you could group by device_id, color_type and select max(uninstall_date) as well, and join to the other table. But you would miss the serial value for the most recent row (for each combination of device_id, color_type).
There are a few ways to fix that. Your attempt with rownum was close, but the problem is that you need to order within each "group" (by device_id, color_type) and get the first row from each group. I am sure someone will post a solution along those lines, using either row_number() or rank() or perhaps the analytic version of max(uninstall_date).
When you just need the "top" row from each group, you can use keep (dense_rank first/last) - which may be slightly more efficient - like so:
select device_id, color_type,
max(serial) keep (dense_rank last order by uninstall_date) as serial,
max(uninstall_date) as uninstall_date
from supply
group by device_id, color_type
;
and then join to the other table. NOTE: dense_rank last will pick up the row OR ROWS with the most recent (max) date for each group. If there are ties, that is more than one row; the serial will then be the max (in lexicographical order) among those rows with the most recent date. You can also select min, or add some order so you pick a specific one (you didn't discuss this possibility).
SELECT
d.ACCOUNT_CODE, d.DNS_HOST_NAME,d.IP_ADDRESS,d.MODEL_NAME,d.OVERRIDE_SERIAL_NUMBER,d.SERIAL_NUMBER,
s.COLOR, s.SERIAL_NUMBER, s.UNINSTALL_TIME
FROM (
SELECT s.DEVICE_ID, s.LAST_LEVEL_READ, s.SERIAL_NUMBER,TRUNC(s.UNINSTALL_TIME), row_number()
OVER (
PARTITION BY DEVICE_ID, COLOR
ORDER BY UNINSTALL_TIME DESC
) As RN
FROM SUPPLY s
WHERE s.UNINSTALL_TIME IS NOT NULL AND s.SERIAL_NUMBER IS NOT NULL
)
JOIN DEVICE d
ON d.ID = s.DEVICE_ID AND s.RN=1;
#krokodilko: thank you very much for your help. First query works. Modified it in order to remove junk, putting real columns name i need (yesterday evening i had no access to the DB) and getting only the data I need.
Unfortunately, when I join the two tables as you suggested I get error:
ORA-00904: "S"."RN": invalid identifier
00904. 00000 - "%s: invalid identifier"
If i remove s. before RN, the ORA-00904 moves back to s.DEVICE_ID.

Column to comma separated value in Hive

It's been asked and answered for SQL (Convert multiple rows into one with comma as separator), would any of the approaches mentioned work in Hive, e.g. to go from this:
+------+------+
| Col1 | Col2 |
+------+------+
| a | 1 |
| a | 5 |
| a | 6 |
| b | 2 |
| b | 6 |
+------+------+
to this:
+------+-------+
| Col1 | Col2 |
+------+-------+
| a | 1,5,6 |
| b | 2,6 |
+------+-------+
The aggregator function collect_set can achieve what you are trying to get. Here is the documentation. So you can write a query like:
SELECT Col1, collect_set(Col2)
FROM your_table
GROUP BY Col1;
However, there is one striking difference between MySQL's GROUP BY and Hive's collect_set that while GROUP_CONCAT also retains duplicates in the resulting array, collect_set removes the duplicates occuring in the array. In the example shown by you there are no repeating group values for Col2 so you can go ahead and use it.
And there is collect_list that will take full list (with duplicates).
Try this
SELECT Col1, concat_ws(',', collect_set(Col2)) as col2
FROM your_table
GROUP BY Col1;
apache.org documentation

How to repeat query in Oracle Forms upon dynamically changing ORDER_BY clause?

I have an Oracle Forms 6i form with a data block that consists of several columns.
------------------------------------------------------------------------------
| FIRST_NAME | LAST_NAME | DEPARTMENT | BIRTH_DATE | JOIN_DATE | RETIRE_DATE |
------------------------------------------------------------------------------
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
------------------------------------------------------------------------------
The user can press F7 (to Enter in Query Mode, for example, he/she types JOH% in the first_name and H% in the DEPARTMENT field) , then F8 to execute the query and see the results. In this example, a list of all employees with their last name starting with JOH and working in any department starting with H will be listed. Here is a sample output of that query
------------------------------------------------------------------------------
| FIRST_NAME | LAST_NAME | DEPARTMENT | BIRTH_DATE | JOIN_DATE | RETIRE_DATE |
------------------------------------------------------------------------------
| MIKE | JOHN | HUMAN RES. | 05-MAY-82 | 02-FEB-95 | |
| BEN | JOHNATHAN | HOUSING | 23-APR-76 | 16-AUG-98 | |
| SMITH | JOHN | HOUSING | 11-DEC-78 | 30-JUL-91 | |
| | | | | | |
------------------------------------------------------------------------------
I then added a small button on top of each column to allow the user to sort the data by the desired column, by executing WHEN-BUTTON-PRESSED trigger:
set_block_property('dept', order_by, 'first_name desc');
The good news is that the ORDER_BY does change. The bad news is that the user never notice the change because he/she will need to do another query and execute to see the output ordered by the column they selected. In other words, user will only notice the change in the next query he/she will execute.
I tried to automatically execute the query upon changing the ORDER_BY clause like this:
set_block_property('dept', order_by, 'first_name desc');
go_block('EMPLOYEE');
do_key('EXECUTE_QUERY');
/* EXECUTE_QUERY -- same thing */
but what happens is that all data from the table is selected, ignoring the criteria that the user has initially set during the query mode entry.
I also searched for a solution to this problem and most of them deal with SYSTEM.LAST_QUERY and default_where. The problem is, last_query can refer to a different block from a different form, that is not valid on the currently displayed data bloc.
How can do the following in just one button press:
1- Change the ORDER_BY clause of the currently active datablock
and: 2- Execute the last query that the user has executed, using the same criteria that was set?
Any help will be highly appreciated.
You can get the last query of the block with get_block_property built-in function:
GET_BLOCK_PROPERTY('EMPLOYEE', LAST_QUERY);
Another option is to provide separate search field(s) on the form, instead of using the QBE functionality.

Resources