Is there a Hive equivalent of SQL "not like" - syntax

While Hive supports positive like queries: ex.
select * from table_name where column_name like 'root~%';
Hive Does not support negative like queries: ex.
select * from table_name where column_name not like 'root~%';
Does anyone know an equivalent solution that Hive does support?

Try this:
Where Not (Col_Name like '%whatever%')
also works with rlike:
Where Not (Col_Name rlike '.*whatever.*')

NOT LIKE have been supported in HIVE version 0.8.0, check at JIRA.
https://issues.apache.org/jira/browse/HIVE-1740

In SQL:
select * from table_name where column_name not like '%something%';
In Hive:
select * from table_name where not (column_name like '%something%');

Check out https://cwiki.apache.org/confluence/display/Hive/LanguageManual if you haven't. I reference it all the time when I'm writing queries for hive.
I haven't done anything where I'm trying to match part of a word, but you might check out RLIKE (in this section https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#Relational_Operators)
This is probably a bit of a hack job, but you could do a sub query where you check if it matches the positive value and do a CASE (http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF#Conditional_Functions) to have a known value for the main query to check against to see if it matches or not.
Another option is to write a UDF which does the checking.
I'm just brainstorming while sitting at home with no access to Hive, so I may be missing something obvious. :)
Hope that helps in some fashion or another. \^_^/
EDIT: Adding in additional method from my comment below.
For your provided example colName RLIKE '[^r][^o][^o][^t]~\w' That may not be the optimal REGEX, but something to look into instead of sub-queries

Using regexp_extract works as well:
select * from table_name where regexp_extract(my_column, ('myword'), 0) = ''

Actually, you can make it like this:
select * from table_name where not column_name like 'root~%';

In impala you can use != for not like:
columnname != value

as #Sanjiv answered
hive has support not like
0: hive> select * from dwtmp.load_test;
+--------------------+----------------------+
| load_test.item_id | load_test.item_name |
+--------------------+----------------------+
| 18282782 | NW |
| 1929SEGH2 | BSTN |
| 172u8562 | PLA |
| 121232 | JHK |
| 3443453 | AG |
| 198WS238 | AGS |
+--------------------+----------------------+
6 rows selected (0.224 seconds)
0: hive> select * from dwtmp.load_test where item_name like '%ST%';
+--------------------+----------------------+
| load_test.item_id | load_test.item_name |
+--------------------+----------------------+
| 1929SEGH2 | BSTN |
+--------------------+----------------------+
1 row selected (0.271 seconds)
0: hive> select * from dwtmp.load_test where item_name not like '%ST%';
+--------------------+----------------------+
| load_test.item_id | load_test.item_name |
+--------------------+----------------------+
| 18282782 | NW |
| 172u8562 | PLA |
| 121232 | JHK |
| 3443453 | AG |
| 198WS238 | AGS |
+--------------------+----------------------+
5 rows selected (0.247 seconds)

Related

How to reset textinputformat.record.delimiter to its default value within hive cli / beeline?

Setting textinputformat.record.delimiter to a non-default value, is useful for loading multi-row text, as shown in the demo below.
However, I'm failing to set this parameter back to its default value without exiting the cli and reopen it.
None of the following options worked (nor some other trials)
set textinputformat.record.delimiter='\n';
set textinputformat.record.delimiter='\r';
set textinputformat.record.delimiter='\r\n';
set textinputformat.record.delimiter='
';
reset;
Any thought?
Thanks
Demo
create table mytable (mycol string);
insert into mytable select concat('Hello',unhex('A'),'world');
select concat('>>>',mycol,'<<<') as mycol from mytable;
NewLine is interpreted is record delimiter, causing the insert of 2 records
+-------------+
| mycol |
+-------------+
| >>>Hello<<< |
| >>>world<<< |
+-------------+
set textinputformat.record.delimiter='\0';
truncate table mytable;
insert into mytable select concat('Hello',unhex('A'),'world');
select concat('>>>',mycol,'<<<') as mycol from mytable;
The whole text was inserted as a single record
+----------+
| mycol |
+----------+
| >>>Hello |
| world |
| <<< |
+----------+
Trying to change the delimiter back to newline
set textinputformat.record.delimiter='\n';
truncate table mytable;
insert into mytable select concat('Hello',unhex('A'),'world');
select concat('>>>',mycol,'<<<') as mycol from mytable;
Still get the same results
+----------+
| mycol |
+----------+
| >>>Hello |
| world |
| <<< |
+----------+
Have you checked the "textinputformat.record.delimiter" variable state? Was it really changed? You could do it calling set textinputformat.record.delimiter without any value. If it was changed, but not works, you could definitely create issue in the issue tracker. As a workaround for setting delimiter param back to default value, you could try RESET command. It would reset ALL properties to default values, though this solution could be unacceptable for your case.
use unicode alt+A or \u0001 as delimer.

Hive: How do I join with a between dates condition?

I have table of items:
| id | dateTimeUTC | color |
+----+------------------+-------+
| 1 | 1/1/2001 1:11:11 | Red |
+----+------------------+-------+
| 2 | 2/2/2002 2:22:22 | Blue |
+----+------------------+-------+
It contains some events with a dateTime in it. I also have an events table:
| eventID | startDate | endDate |
+---------+-------------------+------------------+
| 1 | 1/1/2001 1:11:11 | 2/2/2002 2:22:22 |
+---------+-------------------+------------------+
| 2 | 3/3/2003 00:00:00 | 3/3/2003 1:11:11 |
+---------+-------------------+------------------+
I want to join the two, getting where the dateTimeUTC of the item table is in between the start and end date of the events table. Now, to do this in sql is pretty standard, but HQL not so much. Hive doesn't let you have anything but an "=" in the join clause. (Link to HIVE info here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins). Now, there was a question about a similar situation before here, but I found that it's been 4 years since then and have hoped there was a solution.
Any tips on how to make this happen?
I think you have string format for dates in tables , If yes use following ... Making date into standard format.
select * from items_x, items_date where UNIX_TIMESTAMP(dateTimeUTC,'dd/MM/yyyy HH:MM:SS') between UNIX_TIMESTAMP(startDate,'DD/MM/YYYY HH:MM:SS') and UNIX_TIMESTAMP(endDate,'DD/MM/YYYY HH:MM:SS') ;

(Nested?) Select statement with MAX and WHERE clause

I'm cranking my head on a set of data in order to generate a report from a Oracle DB.
Data are in two tables:
SUPPLY
DEVICE
There is only one column that links the two tables:
SUPPLY.DEVICE_ID
DEVICE.ID
In SUPPLY, there are these data: (Markdown is not working well. it's supposed to show a table)
| DEVICE_ID | COLOR_TYPE | SERIAL | UNINSTALL_DATE |
|----------- |------------ |-------------- |--------------------- |
| 1232 | 1 | CAP857496 | 08/11/2016,19:10:50 |
| 5263 | 2 | CAP57421 | 07/11/2016,11:20:00 |
| 758 | 3 | CBO753421869 | 07/11/2016,04:25:00 |
| 758 | 4 | CC9876543 | 06/11/2016,11:40:00 |
| 8575 | 4 | CVF75421 | 05/11/2016,23:59:00 |
| 758 | 4 | CAP67543 | 30/09/2016,11:00:00 |
In DEVICE, there are columns that I've to select all (more or less), but each row is unique.
What i need to achieve is:
for each SUPPLY.DEVICE_ID and SUPPLY.COLOR_TYPE, I need the most recent ROW -> MAX(UNINSTALL_DATE)
JOINED with
more or less all the columns in DEVICE.
At the end I should have something like this:
| ACCOUNT_CODE | MODEL | DEVICE.SERIAL | DEVICE_ID | COLOR_TYPE | SUPPLY.SERIAL | UNINSTALL_DATE |
|-------------- |------- |--------------- |----------- |------------ |--------------- |--------------------- |
| BUSTO | MS410 | LM753 | 1232 | 1 | CAP857496 | 08/11/2016,19:10:50 |
| MACCHI | MX310 | XC876 | 5263 | 2 | CAP57421 | 07/11/2016,11:20:00 |
| ASL_COMO | MX711 | AB123 | 758 | 3 | CBO753421869 | 07/11/2016,04:25:00 |
| ASL_COMO | MX711 | AB123 | 758 | 4 | CC9876543 | 06/11/2016,11:40:00 |
| ASL_VARESE | X950 | DE8745 | 8575 | 4 | CVF75421 | 05/11/2016,23:59:00 |
So far, using a nested select like:
SELECT DEVICE_ID,COLOR_TYPE,SERIAL,UNINSTALL_DATE FROM
(SELECT SELECT DEVICE_ID,COLOR_TYPE,SERIAL,UNINSTALL_DATE
FROM SUPPLY WHERE DEVICE_ID = '123456' ORDER BY UNINSTALL_DATE DESC)
WHERE ROWNUM <= 1
I managed to get the highest value on the UNISTALL_DATE column after trying MAX(UNISTALL_DATE) or HIGHEST(UNISTALL_DATE).
I tried also:
SELECT SUPPLY.DEVICE_ID, SUPPLY.COLOR_TYPE, ....
FROM SUPPLY,DEVICE WHERE SUPPLY.DEVICE_ID = DEVICE.ID
and it works, but gives me ALL the items, basically it's a merge of the two tables.
When I try to narrow the data selected, i get errors or a empty result.
I'm starting to wonder that it's not possible to obtain this data and i'm starting to export the data in excel and work from there, but I wish someone can help me before giving up...
Thank you in advance.
for each SUPPLY.DEVICE_ID and SUPPLY.COLOR_TYPE, I need the most recent ROW -> MAX(UNINSTALL_DATE)
Use ROW_NUMBER function in this way:
SELECT s.*,
row_number() OVER (
PARTITION BY DEVICE_ID, COLOR_TYPE
ORDER BY UNINSTALL_DATE DESC
) As RN
FROM SUPPLY s
This query marks most recent rows with RN=1
JOINED with more or less all the columns in DEVICE.
Just join the above query to DEVICE table
SELECT d.*,
x.COLOR_TYPE,
x.SERIAL,
x.UNINSTALL_DATE
FROM (
SELECT s.*,
row_number() OVER (
PARTITION BY DEVICE_ID, COLOR_TYPE
ORDER BY UNINSTALL_DATE DESC
) As RN
FROM SUPPLY s
) x
JOIN DEVICE d
ON d.DEVICE_ID = x.DEVICE_ID AND x.RN=1
OK - so you could group by device_id, color_type and select max(uninstall_date) as well, and join to the other table. But you would miss the serial value for the most recent row (for each combination of device_id, color_type).
There are a few ways to fix that. Your attempt with rownum was close, but the problem is that you need to order within each "group" (by device_id, color_type) and get the first row from each group. I am sure someone will post a solution along those lines, using either row_number() or rank() or perhaps the analytic version of max(uninstall_date).
When you just need the "top" row from each group, you can use keep (dense_rank first/last) - which may be slightly more efficient - like so:
select device_id, color_type,
max(serial) keep (dense_rank last order by uninstall_date) as serial,
max(uninstall_date) as uninstall_date
from supply
group by device_id, color_type
;
and then join to the other table. NOTE: dense_rank last will pick up the row OR ROWS with the most recent (max) date for each group. If there are ties, that is more than one row; the serial will then be the max (in lexicographical order) among those rows with the most recent date. You can also select min, or add some order so you pick a specific one (you didn't discuss this possibility).
SELECT
d.ACCOUNT_CODE, d.DNS_HOST_NAME,d.IP_ADDRESS,d.MODEL_NAME,d.OVERRIDE_SERIAL_NUMBER,d.SERIAL_NUMBER,
s.COLOR, s.SERIAL_NUMBER, s.UNINSTALL_TIME
FROM (
SELECT s.DEVICE_ID, s.LAST_LEVEL_READ, s.SERIAL_NUMBER,TRUNC(s.UNINSTALL_TIME), row_number()
OVER (
PARTITION BY DEVICE_ID, COLOR
ORDER BY UNINSTALL_TIME DESC
) As RN
FROM SUPPLY s
WHERE s.UNINSTALL_TIME IS NOT NULL AND s.SERIAL_NUMBER IS NOT NULL
)
JOIN DEVICE d
ON d.ID = s.DEVICE_ID AND s.RN=1;
#krokodilko: thank you very much for your help. First query works. Modified it in order to remove junk, putting real columns name i need (yesterday evening i had no access to the DB) and getting only the data I need.
Unfortunately, when I join the two tables as you suggested I get error:
ORA-00904: "S"."RN": invalid identifier
00904. 00000 - "%s: invalid identifier"
If i remove s. before RN, the ORA-00904 moves back to s.DEVICE_ID.

Column to comma separated value in Hive

It's been asked and answered for SQL (Convert multiple rows into one with comma as separator), would any of the approaches mentioned work in Hive, e.g. to go from this:
+------+------+
| Col1 | Col2 |
+------+------+
| a | 1 |
| a | 5 |
| a | 6 |
| b | 2 |
| b | 6 |
+------+------+
to this:
+------+-------+
| Col1 | Col2 |
+------+-------+
| a | 1,5,6 |
| b | 2,6 |
+------+-------+
The aggregator function collect_set can achieve what you are trying to get. Here is the documentation. So you can write a query like:
SELECT Col1, collect_set(Col2)
FROM your_table
GROUP BY Col1;
However, there is one striking difference between MySQL's GROUP BY and Hive's collect_set that while GROUP_CONCAT also retains duplicates in the resulting array, collect_set removes the duplicates occuring in the array. In the example shown by you there are no repeating group values for Col2 so you can go ahead and use it.
And there is collect_list that will take full list (with duplicates).
Try this
SELECT Col1, concat_ws(',', collect_set(Col2)) as col2
FROM your_table
GROUP BY Col1;
apache.org documentation

Oracle explain plan over simple select performs multiple hash joins when multiple columns are indexed in a table

I am currently running into an issue with my Oracle instance. I have two simple select statements:
select * from dog_vets
and
select * from dog_statuses
and the following fiddle
My explain plan on dog_vets is as follows:
0 | Select Statement
1 | Table Access Full Scan dog_vets
my explain plan on dog_statuses is as follows:
ID|Operation | Name | Rows |Bytes | cost | time
0 | Select Statement | | 20G | 500M | 100000 | 999:99:17
1 | View | index%_join_001 | 20G | 500M | 100000 | 999:99:17
2 | Hash Join | | | | |
3 | Hash Join | | | | |
4 | Index fast full scan dog_statuses_check_up | | 20G | 500M | 100000 | 32:15:00
5 | Index fast full scan dog_statuses_sick| | 20G | 500M | 100000 | 35:19:00
To get this type of output execute the following statement:
explain plan for
select * from dog_vets;
OR
explain plan for
select * from dog_statuses;
and then
select * from table(dbms_xplan.display);
Now my question is, why do multiple indexes imply a view (materialized I assume) being created in my above statements and further what type of performance hit am I suffering on this type of query? As it stands now dog_vets has ~300 million records and dog_Statuses has about 500 million. I have yet to be able to get select * from dog_statuses to return in under 10 hours. This is primarily because the query dies before it completes.
DDL
In case sql fiddle dies:
create table dog_vets
(
name varchar2(50),
founded timestamp,
staff_count number
);
create table dog_statuses
(
check_up timestamp,
sick varchar2(1)
);
create index dog_vet_name
on dog_vets(name);
create index dog_status_check_up
on dog_statuses(check_up);
create index dog_status_sick
on dog_statuses(sick);
You could try to tell the optimizer to forget about indexes
SELECT /*+NO_INDEX(dog_statuses)*/ *
FROM dog_statuses

Resources